At first, we wanted to analyse the impacts of the recent global pandemic on individuals within the context of a variety of mental health conditions, namely anxiety and depression, in a pre- and post-COVID framework. However, we decided to drop the idea due to a lack of data available regarding the topic, which wouldn’t have allowed us to conduct a thorough analysis. As we were searching for new topic ideas, we found a dataset that concerns shark attacks around the world. A group member has then told us about an incident that occurred during his last holidays in Egypt, involving a shark attack at a paradisiac beach. We discussed thereafter the presence of these predators in many different regions in the world including Florida, Hawaii, and Australia just to name a few. For this reason, we decided to dive deeper into the topic of shark attacks and explore potential influencing factors. After a few days of research and information gathering, we found that a wide range of factors are capable of influencing such incidents, both directly and indirectly. We all enjoy going on holidays and enjoy beaches while drinking a cocktail, and when we feel that temperature increases, we go in the water to refresh, but depending on the location we are, this could be unsafe.
Another motivation behind this project is our interest in the preservation of marine fauna. Very often human beings tend to primarily blame the animals for their fierce behavior (just look at the example of the attack that occurred in Hurghada during summer of 2023: after the death of the victim, the shark was killed!) but we often forget that the water it is their natural habitat, and that we are the guests. This point motivates us even more to study the phenomenon of attacks and to disclose how human activity causes shark attacks.
The project’s focus is to provide an understanding of the complex dynamics of shark attacks globally, by conducting an in-depth analysis on the different factors that influence the phenomenon. In addressing the topic at hand, we explore both direct and indirect influencing factors.
At the end of our work we want to learn how it is possible to reduce accidents caused by sharks. On the one hand, this is crucial to guarantee the safety of people who take pleasure in coastal waters and engage in water-related activities. On the other hand, the conservation of aquatic animals might benefit from this. Making efforts to preserve these species is essential for sustaining the health of marine ecosystems in a time when the extinction rate of many animal species is rising. Among the factors that might have an impact on the trend of shark attacks we have climatic variations. Ocean temperatures and currents are capable of affecting migration patterns and habitat preferences, which further complicates the dynamics of shark attacks. By providing a step-by-step analysis on climate-related aspects, we aim to provide empirical evidence on the impact of environmental changes on shark movements and interactions with humans.
This project’s goal is to contribute to the understanding of shark-related incidents and thereby to provide possible strategies such as policymaking, which would help mitigate such occurrences, ultimately safeguarding human life and protecting sharks. In essence, we aim to advance research on sustainable coexistence with the marine environment through actionable insights and intend to contribute to the broader efforts of marine conservation.
In the second section of our work, we will present the datasets that are necessary to answer our research questions. These datasets serve as the foundation for our investigation and provide the data required to draw insightful and important findings. The purpose of this part is to provide a clear overview of the data landscape and make clear to the reader what are the information to be found in each web-based data, and how do they interrelate.
Our second dataset contains the average temperature change for each country (annually, monthly and seasonally) from 1961 to 2022. It contains 9,657 observations.
The dataset was taken from the website of the Food and Agriculture Organization of United Nations: https://www.fao.org/faostat/en/#data/ET
# Create a 2x2 table. The first column represents the name of variables, the second one the description of each variable
Variables <- c("Area Code", "Area", "Months ", "Month code ", "Element Code ", "Element", "Unit", "Years")
Description <- c("Code for each country", "Country for which the data has been collected", "Month during which the data on temperature has been collected", "Code of each month ", "Binary variable (7271: Temperature change; 6078: Standard Deviation)", "What is the data measuring (either temperature change or standard deviation of data collection)", "Unit of measure of temperature ", "Year of observations" )
Table2 <- data.frame(Variable = Variables, Description = Description)
# The knitr::kable() allows to visualize the table
knitr::kable(Table2)
| Variable | Description |
|---|---|
| Area Code | Code for each country |
| Area | Country for which the data has been collected |
| Months | Month during which the data on temperature has been collected |
| Month code | Code of each month |
| Element Code | Binary variable (7271: Temperature change; 6078: Standard Deviation) |
| Element | What is the data measuring (either temperature change or standard deviation of data collection) |
| Unit | Unit of measure of temperature |
| Years | Year of observations |
Our third dataset contains information on the global increase of sea level from 1993 to 2021. The following acronims are important to be noted: GMLS:Global mean sea level GIA: Global Isostatic Adjustment, which is the ongoing movement of land once burdened by ice-age glaciers (National Ocean Service, n.d.)
The dataset was taken from the website of the NASA: https://climate.nasa.gov/vital-signs/sea-level/
# Create a 2x2 table. The first column represents the name of variables, the second one the description of each variable
Variables <- c("Year", "TotalWeightedObservations ", "GMSL_noGIA ", "StdDevGMSL_noGIA ", "SmoothedGSML_noGIA ", "GMSL_GIA", "StdDevGMSL_GIA", "SmoothedGSML_GIA", "SmoothedGSML_GIA_sigremoved")
Description <- c("Year of the observation", "Total Weighted Observations of Sea Level", "Global mean sea level variation in millimeters, with respect to 20-year TOPEX/Jason collinear mean reference (without considering the GIA)", "Standard Deviation of global mean sea level variation estimate in millimeters (without considering the GIA)", "Smoothed (60-day Gaussian type filter) GMSL variation in millimeters (without considering the GIA)", "Global mean sea level variation in millimeters, with respect to 20-year TOPEX/Jason collinear mean reference (considering the GIA)", "Standard Deviation of global mean sea level variation estimate in millimeters (considering the GIA)", "Smoothed (60-day Gaussian type filter) GMSL variation in millimeters (considering the GIA)", "Smoothed (60-day Gaussian type filter) Global Mean Sea Level variation in millimeters (considering the GIA) ; annual and semi-annual signal removed ")
Table3 <- data.frame(Variable = Variables, Description = Description)
# The knitr::kable() allows to visualize the table
knitr::kable(Table3)
| Variable | Description |
|---|---|
| Year | Year of the observation |
| TotalWeightedObservations | Total Weighted Observations of Sea Level |
| GMSL_noGIA | Global mean sea level variation in millimeters, with respect to 20-year TOPEX/Jason collinear mean reference (without considering the GIA) |
| StdDevGMSL_noGIA | Standard Deviation of global mean sea level variation estimate in millimeters (without considering the GIA) |
| SmoothedGSML_noGIA | Smoothed (60-day Gaussian type filter) GMSL variation in millimeters (without considering the GIA) |
| GMSL_GIA | Global mean sea level variation in millimeters, with respect to 20-year TOPEX/Jason collinear mean reference (considering the GIA) |
| StdDevGMSL_GIA | Standard Deviation of global mean sea level variation estimate in millimeters (considering the GIA) |
| SmoothedGSML_GIA | Smoothed (60-day Gaussian type filter) GMSL variation in millimeters (considering the GIA) |
| SmoothedGSML_GIA_sigremoved | Smoothed (60-day Gaussian type filter) Global Mean Sea Level variation in millimeters (considering the GIA) ; annual and semi-annual signal removed |
Our fourth dataset relates to GHG emissions, one of the main concerns linked to global warming. The dataset contains 31.315 observations on annual GHG emissions per country and per continent, but we will focus on the first case only.
The dataset was taken from the website Our World in Data: https://ourworldindata.org/annual-co2-emissions
# Create a 2x2 table. The first column represents the name of variables, the second one the description of each variable
Variables <- c("Entity", "Code", "Year", "Annual CO₂ emissions")
Description <- c("Country examinated", "ISO Code", "Year of the observation", "GHG emissions per year")
Table4 <- data.frame(Variable = Variables, Description = Description)
# The knitr::kable() allows to visualize the table
knitr::kable(Table4)
| Variable | Description |
|---|---|
| Entity | Country examinated |
| Code | ISO Code |
| Year | Year of the observation |
| Annual CO₂ emissions | GHG emissions per year |
The first step of our work is data cleaning. While collecting data for a study, it might happen to have some variables with missing information, worthless data or spelling mistakes In order to run a regression, it is fundamental that each column within a data set has no missing values and that information is written following the same format. This procedure takes the name of data cleaning, and it will cover the whole subsection.
Please be aware that at the beginning of each subsection we only provide general instructions of our cleaning. Through the code, you will get additional in-depth explanations of each step we took.
#We start by selecting the packages that we will need to run our functions
library(tidyverse)
library(knitr)
library(readxl)
library(dplyr)
library(tidyr)
library(stringr)
library(countrycode)
library(ggplot2)
library(plotly)
library(stargazer)
library(gtsummary)
library(corrplot)
library(gridExtra)
The first dataset we will examine is the one pertaining to shark attacks. It was quite full of misspellings and missing values, which required a lot of time and a very dense code to be able to clean it.
For our work, we have decided to only study the phenomenon of shark attacks from 1970 forward. In fact, it is possible to observe in the raw dataset that, before to this date, observations had more erroneous information. This leads one to believe that, more likely than not, the information we needed is more accurate if we do not travel too far back in time.
attacks <- read.csv(here::here("data/attacks1.csv"))
attacks <- attacks[1:3406, ]
#We drop the columns that we do not need. The country is enough for our study, so we will not take into consideration the precise areas. The injury is not useful as well, indeed, the fatality column already gives us enough information. The other columns all contain information that are not useful for a research.
attacks <- subset(attacks, select = setdiff(names(attacks), c('X', 'X.1', 'Area', 'Location', 'Injury', 'href', 'href.formula', 'Investigator.or.Source', 'pdf', 'Name', 'Case.Number.1', 'Case.Number.2', 'original.order')))
#____________________________________________________________________________________
#FIRSTLY WE WORK ON THE COUNTRY COLUMN
#Firstly we put everything in CAPS in order to group countries (The problem was only Fiji which was written sometimes in capital letters and sometimes not. The rest was good). When we run a table, we can see that all countries are written in the same way, but we have 6 NA. We decided to investigate them. To do this, used the function "which" to spot where the NA are. We found out that we could get the country of 2 of them because, the column with the precise area war filled. For row 2957 area says "English Channel", meaning that the attack happened in the UK, while for row 3388 the area is "Carribean Sea". As a consequence, we will fill this spot with Jamaica, since it is central to Carribean and has average temperatures similar to sourrounding countries.
#After that, we have investigated the other 4 NA: they don't have anything in common, nor we found any information about those attacks on internet, so we just deleted the rows. Now countries are clean.
attacks$Country <- toupper(attacks$Country)
attacks$Country <- na_if(attacks$Country, "") #We turned blank spots to NA
rows_with_NA <- which(is.na(attacks$Country))
attacks$Country <- gsub("COLUMBIA", "COLOMBIA", attacks$Country, ignore.case = TRUE)
attacks[2957, "Country"] <- "UNITED KINGDOM"
attacks[3388, "Country"] <- "JAMAICA"
attacks$Country <- gsub("UNITED ARAB EMIRATES \\(UAE\\)", "UNITED ARAB EMIRATES", attacks$Country) #We took off the part "\\(UAE\\)" from the United Arab Emirates
attacks <- subset(attacks, !is.na(Country))
#____________________________________________________________________________________
#NOW WE WORK ON THE DATE
#For this column, we will drop both the year (which is already included in another column) and the day, which we consider useless therefore, we will only keep the month
attacks$Date <- gsub("\\bReported\\b", "", attacks$Date, ignore.case = TRUE) #Remove the word "reported" which appeared sometimes
attacks$Date <- gsub("\\bEarly\\b", "", attacks$Date, ignore.case = TRUE)#Remove the word "Early" which appeared sometimes
attacks$Date <- gsub("[[:digit:][:punct:]]", "", attacks$Date) #remove all signs
words_to_keep <- c("Jan", "Feb", "Mar", "Apr","May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
# Remove words not in the specific list of months (words_to_keep) for each row
attacks$Date <- sapply(attacks$Date, function(sentence) {
gsub(paste0("\\b(?!", paste(words_to_keep, collapse = "|"), "\\b)\\w+\\s*"), "", sentence, perl = TRUE)
})
attacks$Date <- trimws(attacks$Date) #remove white spaces (" Feb" becomes "Feb")
#when we run the table, we see that there are still some problems, that we will correct by hand (Jul Jul, Jula etc.). We run the function "gsub" to correct mistakes.
attacks$Date <- gsub("Jul Jul", "Jul", attacks$Date)
attacks$Date <- gsub("Jula", "Jul", attacks$Date)
attacks$Date <- gsub("Julb", "Jul", attacks$Date)
attacks$Date <- gsub("July", "Jul", attacks$Date)
attacks$Date <- gsub("June", "Jun", attacks$Date)
attacks$Date <- gsub("May Nov", "Aug", attacks$Date)
attacks$Date <- gsub("November", "Nov", attacks$Date)
#Once the spelling mistakes are good, We replace all NA by blank spaces and investigate them. Some of them actually have their month on the column "Case Number". The others don't have anything in common.
rows_with_mistakes <- attacks[attacks$Date == "", ]
attacks$Date[is.na(attacks$Date)] <- ""
attacks <- mutate(attacks, Date = ifelse(Case.Number == "2017.06.05", "Jun", Date))
attacks <- mutate(attacks, Date = ifelse(Case.Number == "2008.01.30", "Jan", Date))
attacks <- mutate(attacks, Date = ifelse(Case.Number == "2001.04.02.b", "Apr", Date))
attacks <- mutate(attacks, Date = ifelse(Case.Number == "1980.12.30", "Dec", Date))
attacks <- mutate(attacks, Date = ifelse(Case.Number == "2012.12.00", "Dec", Date))
#Finally, we remove the few NA that remained because they had no information that would have helped us getting the information.
attacks$Date <- na_if(attacks$Date, "")
attacks <- subset(attacks, !is.na(Date))
date_logic <- c("Jan" = 1, "Feb" = 2, "Mar" = 3, "Apr" = 4, "May" = 5, "Jun" = 6,
"Jul" = 7, "Aug" = 8, "Sep" = 9, "Oct" = 10, "Nov" = 11, "Dec" = 12)
attacks$Date <- as.integer(factor(attacks$Date, levels = names(date_logic)))
#________________________________________________________________________________________________________________________
#NOW WE WORK ON AGE COLUMN
#To clean the column of the age we took off all the letters, we computed by hand the mean age of all victims that had a range for their year and finally transformed all the ages expressed in months to years (Meaning that: age from 10 to 12 became 11, while 18 months old became 1.5)
attacks$Age <- ifelse(attacks$Age == "18 months", "1.5", attacks$Age)
attacks$Age <- ifelse(attacks$Age == "28 & 26", "27", attacks$Age)
attacks$Age <- ifelse(attacks$Age == "18 or 20", "19", attacks$Age)
attacks$Age <- ifelse(attacks$Age == "12 or 13", "12.5", attacks$Age)
attacks$Age <- ifelse(attacks$Age == "46 & 34", "40", attacks$Age)
attacks$Age <- ifelse(attacks$Age == "28, 23 & 30", "27", attacks$Age)
attacks$Age <- ifelse(attacks$Age == "36 & 26", "31", attacks$Age)
attacks$Age <- ifelse(attacks$Age == "8 or 10", "9", attacks$Age)
attacks$Age <- ifelse(attacks$Age == "30 or 36", "33", attacks$Age)
attacks$Age <- ifelse(attacks$Age == "6½", "6.5", attacks$Age)
attacks$Age <- ifelse(attacks$Age == "21 & ?", "21", attacks$Age)
attacks$Age <- ifelse(attacks$Age == "33 or 37", "35", attacks$Age)
attacks$Age <- ifelse(attacks$Age == "23 & 20", "21.5", attacks$Age)
attacks$Age <- ifelse(attacks$Age == "7 & 31", "19", attacks$Age)
attacks$Age <- ifelse(attacks$Age == "32 & 30", "31", attacks$Age)
attacks$Age <- ifelse(attacks$Age == "16 to 18", "27", attacks$Age)
attacks$Age <- ifelse(attacks$Age == "21 or 26", "23.5", attacks$Age)
attacks$Age <- str_replace_all(attacks$Age, "[A-Za-z]", "")
#Delete all letters and signs
attacks$Age <- na_if(attacks$Age, "") #replace blank spots with NA
attacks$Age <- as.numeric(str_extract(attacks$Age, "\\d+\\.?\\d*")) #put age as logic and not char
While cleaning the column of Ages, we realized that NA were too much and we could not delete them all, because that would have meant losing 29,5% of our data. Therefore, we have decided to work in the following way: for each country that presents LESS NA than the total observation, we compute an histogram (Age against frequency) of non-missing data. If the distribution is normal, we substitute the NA with the mean. If it is left skewed, we substitute the NAs with the first quantile, if it is right skewed, we substitute it with the third quantile.
Here is an example with USA and Bahamas. For USA, the distribution of the non-missing victims’ ages is skewed on the left. Therefore we have decided to substitute all the missing values with the first quantile. For Bahamas, it is evident that data follow a normal distribution, therefore we replace NAs with the mean.
#At this point, NA were still too much, so we will work in the following way: for each country that presents LESS NA than the total observation, we compute an histogram (Age against frequency). If the distribution is normal, we substitute the NA with the mean. If it is left skewed, we substitute the NAs with the first quantile, if it is right skewed, we substitute it with the third quantile
#USA#
ages_in_USA <- attacks %>%
filter(Country == "USA") %>%
select(Age)
ages_vector <- ages_in_USA$Age
ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
hist(ages_without_na,
col = "skyblue",
xlab = "Ages",
ylab = "Frequency",
main = "Histogram of Ages in USA without NA"
) #We can see that the histogram is skewed, so we will replace the NAs by the first quartile of non-missing values
q.25 <- quantile(ages_without_na, 0.25)
attacks$Age[attacks$Country == "USA" & is.na(attacks$Age)] <- q.25
#BAHAMAS#
ages_in_BAH <- attacks %>%
filter(Country == "BAHAMAS") %>%
select(Age)
ages_vector <- ages_in_BAH$Age
ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
hist(ages_without_na,
col = "darkgreen",
xlab = "Ages",
ylab = "Frequency",
main = "Histogram of Ages in Bahamas without NA"
)
mean_BAH = mean(ages_without_na)
attacks$Age[attacks$Country == "BAHAMAS" & is.na(attacks$Age)] <- mean_BAH
#EGYPT#
ages_in_egy <- attacks %>%
filter(Country == "EGYPT") %>%
select(Age)
ages_vector <- ages_in_egy$Age
ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean_egy = mean(ages_without_na)
attacks$Age[attacks$Country == "EGYPT" & is.na(attacks$Age)] <- mean_egy
#ITALY#
# Extract all ages from individuals who come from Italy
ages_in_italy <- attacks %>%
filter(Country == "ITALY") %>%
select(Age)
# Convert the extracted ages to a vector
ages_vector <- ages_in_italy$Age
# Display or use the 'ages_vector' containing ages from individuals in Italy
ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean_age = mean(ages_without_na)
# Replace all the NA on Italy with the mean of people attacked in Italy.
attacks$Age[attacks$Country == "ITALY" & is.na(attacks$Age)] <- 37.375
#AUSTRALIA#
ages_in_aus <- attacks %>%
filter(Country == "AUSTRALIA") %>%
select(Age)
ages_vector <- ages_in_aus$Age
ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean_aus = mean(ages_without_na)
attacks$Age[attacks$Country == "AUSTRALIA" & is.na(attacks$Age)] <- mean_aus
#SOUTH AFRICA#
ages_in_SA <- attacks %>%
filter(Country == "SOUTH AFRICA") %>%
select(Age)
ages_vector <- ages_in_SA$Age
ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean_SA = mean(ages_without_na)
attacks$Age[attacks$Country == "SOUTH AFRICA" & is.na(attacks$Age)] <- mean_SA
#BRAZIL#
ages_in_BRA <- attacks %>%
filter(Country == "BRAZIL") %>%
select(Age)
ages_vector <- ages_in_BRA$Age
ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean_BRA = mean(ages_without_na)
attacks$Age[attacks$Country == "BRAZIL" & is.na(attacks$Age)] <- mean_BRA
#FIJI#
ages <- attacks %>%
filter(Country == "FIJI") %>%
select(Age)
ages_vector <- ages$Age
ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)
attacks$Age[attacks$Country == "FIJI" & is.na(attacks$Age)] <- mean
#FRENCH POLYNESIA#
ages <- attacks %>%
filter(Country == "FRENCH POLYNESIA") %>%
select(Age)
ages_vector <- ages$Age
ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)
attacks$Age[attacks$Country == "FRENCH POLYNESIA" & is.na(attacks$Age)] <- mean
#HONG KONG#
ages <- attacks %>%
filter(Country == "HONG KONG") %>%
select(Age)
ages_vector <- ages$Age
ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)
attacks$Age[attacks$Country == "HONG KONG" & is.na(attacks$Age)] <- mean
#JAPAN#
ages <- attacks %>%
filter(Country == "JAPAN") %>%
select(Age)
ages_vector <- ages$Age
ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)
attacks$Age[attacks$Country == "JAPAN" & is.na(attacks$Age)] <- mean
#MEXICO#
ages <- attacks %>%
filter(Country == "MEXICO") %>%
select(Age)
ages_vector <- ages$Age
ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)
attacks$Age[attacks$Country == "MEXICO" & is.na(attacks$Age)] <- mean
#MOZAMBIQUE#
ages <- attacks %>%
filter(Country == "MOZAMBIQUE") %>%
select(Age)
ages_vector <- ages$Age
ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)
attacks$Age[attacks$Country == "MOZAMBIQUE" & is.na(attacks$Age)] <- mean
#NEW ZELAND#
ages <- attacks %>%
filter(Country == "NEW ZEALAND") %>%
select(Age)
ages_vector <- ages$Age
ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)
attacks$Age[attacks$Country == "NEW ZEALAND" & is.na(attacks$Age)] <- mean
#PHILIPPINES#
ages <- attacks %>%
filter(Country == "PHILIPPINES") %>%
select(Age)
ages_vector <- ages$Age
ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)
attacks$Age[attacks$Country == "PHILIPPINES" & is.na(attacks$Age)] <- mean
#REUNION#
ages <- attacks %>%
filter(Country == "REUNION") %>%
select(Age)
ages_vector <- ages$Age
ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)
attacks$Age[attacks$Country == "REUNION" & is.na(attacks$Age)] <- mean
#NEW CALEDONIA#
ages <- attacks %>%
filter(Country == "NEW CALEDONIA") %>%
select(Age)
ages_vector <- ages$Age
ages_without_na = na.omit(ages_vector)
ages_without_na <- as.numeric(ages_without_na)
#hist(ages_without_na) -> We put this function preceded by an hashtag because we do not want the histogram to be displayed for all countries
mean = mean(ages_without_na)
attacks$Age[attacks$Country == "NEW CALEDONIA" & is.na(attacks$Age)] <- mean
#We fixed those countries that had lots of NA. Now, for the remaining NA we decided not to do anything for a specific reason. Not only we have not cound the real information on the internet, but we also believe that doing the mean for those remaining situations was useless, because these are the cases when NA are either the same amount of total shark attacks (see Antigua with 1 shark attack and 1 NA), or NA are more than half of the total attacks (see Malaysia with 4 total attacks but 3 of them are NA).
count_na_by_country <- attacks %>%
group_by(Country) %>%
summarise(NA_count = sum(is.na(Age)))
attacks <- subset(attacks, !is.na(Age))
#___________________________________________________________________________________________________________________
#NOW WORK ON TIME
#sometimes hours are written in a different format compared to most of the others, which follow the format
#13h30. Since they are not a lot we just replaced them manually
attacks<- mutate(attacks, Time= ifelse(Time=="Possibly same incident as 2000.08.21", "", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="14h00 -15h00", "14h30", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="14h30 / 15h30", "15h00", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="09h30 / 10h00", "9h45", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="10h45-11h15", "11h00", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="Sometime between 06h00 & 08hoo", "7h00", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="18h15-18h30", "18h20", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="09h00 - 09h30", "9h15", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="10h00 -- 11h00", "10h30", Time))
attacks<- mutate(attacks, Time= ifelse(Time=="11h115", "11h15", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "Between 05h00 and 08h00", "6h30", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "17h00 or 17h40", "17h20", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "13h345", "13h45", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "<a0> ", "", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "09h00 -10h00", "9h30", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "2 hours after Opperman", "", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "11h00 / 11h30", "11h15", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "Between 06h00 & 07h20", "6h40", Time))
attacks <- mutate(attacks, Time = ifelse(Time == "30 minutes after 1992.07.08.a", "", Time))
#now, our objective is to classify hours in parts of the day. indeed, it is useless to keep hours as they are
#for a regression because we want times like 7h30 and 7h00 OR 15h00 and 16h00 to be read as the same thing.
# some of the rows already had the part of the day in it, but in order to work easily with all the column, we
#replaced "morning" with 8h00 etc. in this way, we're able to remove all the strings that are useless. finally,
#void rows have been replaced by an NA and hours, which were there as CHAR have been replaced by numeric
attacks$Time <- str_replace_all(attacks$Time, "\\bmorning\\b", "8h00")
attacks$Time <- str_replace_all(attacks$Time, "\\bafternoon\\b", "15h00")
attacks$Time <- str_replace_all(attacks$Time, "\\bevening\\b", "20h00")
attacks$Time <- str_replace_all(attacks$Time, "\\bnight\\b", "23h00")
attacks$Time <- gsub("[^[:digit:]]", "", attacks$Time)
attacks$Time = na_if(attacks$Time, "")
attacks$Time <- as.numeric(attacks$Time)
# since we removed all the letters, hours now are not written as 8h00 or 15h30 but as 800 and 15h00. this is
# great for us, so that we can easily create a function that classifies those numbers as parts of the day. the
#function works this way: if a value is included between 0 and 500 (i.e. midnight and 5a.m.), the value is replaced
#by the word "night" etc.
timeoftheday <- function(time) {
if (!is.na(time)) {
if (time >= 0 && time < 500) {
return("night")
} else if (time >= 500 && time < 1200) {
return("morning")
} else if (time >= 1200 && time < 1730) {
return("afternoon")
} else if (time >= 1730 && time < 2400) {
return("evening")
} else {
return("") # Handle any other cases (if necessary)
}
} else {
return(NA) # Retain NA values
}
}
attacks$Time <- sapply(attacks$Time, timeoftheday)
attacks$Time <- tolower(attacks$Time)
attacks$Time <- na_if(attacks$Time, "")
sum(is.na(attacks$Time)) #this is the only one that still presents 1415 NA. we dont want to delete
#them all coz we'd lose 44% of our data.
table(attacks$Time)#table shows that most of them happen in the afternoon, while 2ns place is owned by
#morning. To confirm the higher frequency of attacks between 8am and 6pm is this artile (link JC found??)
#we explain this by the fact that, naturally, most of people swim during the day. therefore, what we do
#is replacing NA randomly by the proportion of afternoon, morning and evening.
sum(!is.na(attacks$Time))
#create function that substitutes the NA in time with one of the 4 parts of the day, based on their
#proportion presence in our dataset
non_na_time_proportions <- table(attacks$Time) / sum(!is.na(attacks$Time))
# Identify NA positions
position_of_na <- is.na(attacks$Time)
# Generate random values based on proportions
attacks$Time[position_of_na] <- sample(
names(non_na_time_proportions),
sum(position_of_na),
replace = TRUE,
prob = as.vector(non_na_time_proportions)
)
#_____________________________________________________________________________________________
#NOW WE WORK ON SEX
attacks$Sex <- gsub("lli", "", attacks$Sex)
attacks$Sex <- gsub("M ", "M", attacks$Sex)
attacks$Sex <- na_if(attacks$Sex, "")
attacks$Sex <- ifelse(is.na(attacks$Sex), "Unknown", attacks$Sex)
#__________________________________________________________________________________________________________________
#NOW WE WORK ON SHARK SPECIES
#Creation of a variable species where we delete all the numbers that concern the size of the shark
attacks$Species <- str_remove_all(attacks$Species, "[[:digit:]]")
#If there is an empty cell, we will put NA
species = na_if(attacks$Species, "")
#gsub to take off punctuation
species <- gsub("[[:punct:]]", "", species)
#Delete all the numbers followed by cm or m
species <- gsub("\\d+\\s*(cm|m)\\b", "", species)
#Delete word composed by 1 or 2 letters
species <- gsub("\\b\\w{1,2}\\b", "", species)
#Delete all the numbers
species <- gsub("\\d", "", species)
#Delete all unities as lb, kg or l
species <- gsub("\\b(kg|lb|b)\\b", "", species)
#If the word shark appears more than once in each cell, we only write shark once
species <- gsub("\\bshark\\b(.*\\bshark\\b)?", "shark", species, ignore.case = TRUE)
# Rewrite shark species names in one way, and do that for each species
species<- gsub("^.*White shark.*$", "White shark", species)
species<- gsub("^.*whitetip shark.*$", "White shark", species)
species<- gsub("^.*white shark.*$", "White shark", species)
species<- gsub("^.*Wobbegong shark.*$", "Wobbegong shark", species)
species<- gsub("^.*Wobbegong.*$", "Wobbegong shark", species)
species<- gsub("^.*Zambesi shark.*$", "Zambesi shark", species)
species<- gsub("^.*Zambezi shark.*$", "Zambesi shark", species)
species<- gsub("^.*whaler shark.*$", "Whale shark", species)
species<- gsub("^.*whale shark.*$", "Whale shark", species)
species<- gsub("^.*tiger shark.*$", "Tiger shark", species)
species<- gsub("^.*Tiger shark.*$", "Tiger shark", species)
species<- gsub("^.*thresher shark.*$", "Thresher shark", species)
species<- gsub("^.*spinner shark.*$", "Spinner shark", species)
species<- gsub("^.*Spinner shark feet.*$", "Spinner shark", species)
species<- gsub("^.*Spinner shark.*$", "Spinner shark", species)
species<- gsub("^.*spinner blacktip shark.*$", "Spinner shark", species)
species<- gsub("^.*Tawny nurse shark.*$", "Nurse shark", species)
species<- gsub("^.*nurse shark.*$", "Nurse shark", species)
species<- gsub("^.*Nurse shark.*$", "Nurse shark", species)
species<- gsub("^.*Mako shark.*$", "Mako shark", species)
species<- gsub("^.*mako shark.*$", "Mako shark", species)
species<- gsub("^.*Lemon shark.*$", "Lemon shark", species)
species<- gsub(".*lemon shark.*", "Lemon shark", species, ignore.case = TRUE)
species<-gsub("^\\s*shark\\s*$", "Unidentified shark", species)
species<- gsub(".*bull.*", "Bull shark", species, ignore.case = TRUE)
species<- gsub(".*blue.*", "Blue shark", species, ignore.case = TRUE)
species<- gsub(".*reef.*", "Reef shark", species, ignore.case = TRUE)
species<- gsub(".*sand shark*", "Sand shark", species, ignore.case = TRUE)
species<- gsub("^.*Sand shark.*$", "Sand shark", species)
species<- gsub(".*Sand shark*", "Sand shark", species, ignore.case = TRUE)
species<- gsub(".*sandshark*", "Sand shark", species, ignore.case = TRUE)
species<- gsub(".*Sandbar shark*", "Sand shark", species, ignore.case = TRUE)
species<- gsub("juvenile\\s+\\w+", "Juvenile shark",species, ignore.case = TRUE)
species<- gsub("Juvenile shark shark", "Juvenile shark",species, ignore.case = TRUE)
species<- gsub("Juvenile shark blacktip shark", "Juvenile shark",species, ignore.case = TRUE)
species<- gsub("blacktip\\s+\\w+", "Blacktip shark",species, ignore.case = TRUE)
species <- gsub("\\black\\w*\\b", "Blacktip shark", species, ignore.case = TRUE)
#Replace all occurrences of the word "blacktip" in a column with "Blacktip Shark," even if the word is accompanied by other words before or after
species <- gsub("\\bblacktip\\b", "Blacktip Shark", species, ignore.case = TRUE)
species<- gsub(".*copper shark*", "Copper shark", species, ignore.case = TRUE)
species <- gsub("\\bcow\\b", "Cow", species)
species <- gsub("\\bsilky\\b", "Silky", species)
species <- gsub("\\bsilvertip\\b", "Silvertip", species)
#Condition: if "Hammerhead" followed by other words then replaced by Hammerhead shark
species <- ifelse(grepl("Hammerhead\\s+\\w+", species, ignore.case = TRUE), "Hammerhead shark", species)
species <- ifelse(grepl("Blacktip\\s+\\w+", species, ignore.case = TRUE), "Blacktip shark", species)
species <- ifelse(grepl("Raggedtooth\\s+\\w+", species, ignore.case = TRUE), "Raggedtooth shark", species)
species <- ifelse(grepl("Porbeagle\\s+\\w+", species, ignore.case = TRUE), "Porbeagle shark", species)
#if the word shark does not appear then NA
species <- ifelse(grepl("shark", species, ignore.case = TRUE), species, NA)
species <- ifelse(grepl("Juvenile\\s+\\w+", species, ignore.case = TRUE), "Juvenile shark", species)
species <- ifelse(grepl("grey\\s+\\w+", species, ignore.case = TRUE), "Grey shark", species)
species <- ifelse(grepl("greycolored\\s+\\w+", species, ignore.case = TRUE), "Grey shark", species)
species <- ifelse(grepl("gray\\s+\\w+", species, ignore.case = TRUE), "Grey shark", species)
species <- ifelse(grepl("Broadnose\\s+\\w+", species, ignore.case = TRUE), "Sevengill shark", species)
species <- ifelse(grepl("7gill\\s+\\w+", species, ignore.case = TRUE), "Sevengill shark", species)
species <- ifelse(grepl("sevengill\\s+\\w+", species, ignore.case = TRUE), "Sevengill shark", species)
species <- ifelse(grepl("black\\s+\\w+", species, ignore.case = TRUE), "Blacktip shark", species)
species <- ifelse(grepl("Sand\\s+\\w+", species, ignore.case = TRUE), "Sand shark", species)
species <- ifelse(grepl("Carpet\\s+\\w+", species, ignore.case = TRUE), "Carpet shark", species)
species <- ifelse(grepl("\\bbrown\\b", species, ignore.case = TRUE), "Brown shark", species)
species <- gsub("^frag\\w+", "Unidentified shark", species)
species <- gsub("^shark$", "Unidentified shark", species)
#Function to check if "small" is present in cell
contains_small <- function(text) {
return(grepl("small", text))
}
# Replace cell with "unidentified shark" if "small" is present
species <- ifelse(sapply(species, contains_small), "Unidentified shark", species)
#Function to check if "Small" is present in cell
contains_small <- function(text) {
return(grepl("Small", text))
}
# Replace cell with "unidentified shark" if "Small" is present
species <- ifelse(sapply(species, contains_small), "Unidentified shark", species)
#Function to check if "sharks" is present in cell
contains_sharks <- function(text) {
return(grepl("sharks", text))
}
# Replace cell with"unidentified shark" if "sharks" is present
species <- ifelse(sapply(species, contains_sharks), "Unidentified shark", species)
#Function to check if "cookie" is present in cell
contains_cookie <- function(text) {
return(grepl("cookie", text))
}
# Replace cell with"Cookiecutter shark" if "cookie" is present
species <- ifelse(sapply(species, contains_cookie), "Cookiecutter shark", species)
#Function to check if "involvement" is present in cell
contains_involvement <- function(text) {
return(grepl("involvement", text))
}
# Replace cell with"No shark" if "involvement" is present
species <- ifelse(sapply(species, contains_involvement), "No shark", species)
#Function to check if "invovlement" is present in cell
contains_invovlement <- function(text) {
return(grepl("invovlement", text))
}
# Replace cell with"No shark" if "invovlement" is present
species <- ifelse(sapply(species, contains_invovlement), "No shark", species)
#Function to check if "Not" is present in cell
contains_Not <- function(text) {
return(grepl("Not", text))
}
# Replace cell with"No shark" if "Not" is present
species <- ifelse(sapply(species, contains_Not), "No shark", species)
#Function to check if "not" is present in cell
contains_not <- function(text) {
return(grepl("not", text))
}
# Replace cell with"No shark" if "not" is present
species <- ifelse(sapply(species, contains_not), "No shark", species)
#Function to check if "Questionable" is present in cell
contains_Questionable <- function(text) {
return(grepl("Questionable", text))
}
# Replace cell with"No shark" if "Questionable" is present
species <- ifelse(sapply(species, contains_Questionable), "No shark", species)
#Function to check if "questionable" is present in cell
contains_questionable <- function(text) {
return(grepl("questionable", text))
}
# Replace cell with"No shark" if "questionable" is present
species <- ifelse(sapply(species, contains_questionable), "No shark", species)
#Function to check if "Salmon" is present in cell
contains_Salmon <- function(text) {
return(grepl("Salmon", text))
}
# Replace cell with"Salmon shark" if "Salmon" is present
species <- ifelse(sapply(species, contains_Salmon), "Salmon shark", species)
#Function to check if "whaler" is present in cell
contains_whaler <- function(text) {
return(grepl("whaler", text))
}
# Replace cell with"Whale shark" if "involvement" is present
species <- ifelse(sapply(species, contains_whaler), "Whale shark", species)
# Replace the cell with "Unidentified shark" if any of the specified words are found
species <- ifelse(grepl("(seen|observed|Tooth|tooth|large|killed|captive|female|metre|foot|followed|caused|Said|young|probably|gaffed)", species, ignore.case = TRUE), "Unidentified shark", species)
# Replace the cell with "No shark" if any of the specified words are found
species <- ifelse(grepl("(hoax|No Shark)", species, ignore.case = TRUE), "No shark", species)
# Replace the cell with "Copper shark" if any of the specified words are found
species <- ifelse(grepl("(Copper)", species, ignore.case = TRUE), "Copper shark", species)
# Replace the cell with "Dogfish shark" if any of the specified words are found
species <- ifelse(grepl("(Dog|dogfish)", species, ignore.case = TRUE), "Dogfish shark", species)
# Replace the cell with "Dusky shark" if any of the specified words are found
species <- ifelse(grepl("(Dusky)", species, ignore.case = TRUE), "Dusky shark", species)
# Replace the cell with "Sevengill shark" if any of the specified words are found
species <- ifelse(grepl("(gill)", species, ignore.case = TRUE), "Sevengill shark", species)
# Replace the cell with "Angel shark" if any of the specified words are found
species <- ifelse(grepl("(Angel)", species, ignore.case = TRUE), "Angel shark", species)
# Replace the cell with "Unidentified shark" if any of the specified words are found
species <- ifelse(grepl("(Unidentifies)", species, ignore.case = TRUE), "Sevengill shark", species)
attacks$Species <- species
attacks$Species <- ifelse(is.na(attacks$Species), "Unidentifies shark", attacks$Species)
#_____________________________________________________________________________________________
#ACTIVITY
#when trying to put everything in lower cap, this was the result: Errore in tolower(attacks$Activity) : multibyte string 921 not valid --> therefore, we converted everything in ASCII (=American Standard Code for Information Interchange)
attacks$Activity <- iconv(attacks$Activity, to = "ASCII", sub = " ")
attacks$Activity <- gsub("[^ -~]", "", attacks$Activity)
attacks$Activity <- tolower(attacks$Activity)
#when running a table of all activities, we can see that we can group them in some categories: diving, race, windsurfing, walking, wading, wade fishing, touching, swimming, surfing, surf, standing, spearfishing, snorkeling, scuba diving, playing, paddle, murder, kayaking, kayak, floating, fishing, so that we have standard categories.
attacks$Activity <- gsub("[^A-Za-z ]", "", attacks$Activity)
kept_activities <- c("diving", "race", "windsurfing", "walking", "wading", "wade fishing", "touching", "swimming", "surfing", "surf", "standing", "spearfishing", "snorkeling", "scuba diving", "playing", "paddle", "murder", "kayaking", "kayak", "floating", "fishing")
# We create an expression pattern that matches any of the specific words
pattern <- paste(kept_activities, collapse = "|")
# Extract only the specific words (kept_activities) from the column and replace the rest with an empty string
attacks$Activity <- str_extract(attacks$Activity, paste0("\\b(?:", pattern, ")\\b"))
attacks <- attacks %>%
mutate(Activity = str_replace_all(Activity, "\\bkayaking\\b", "kayak")) %>%
mutate(Activity = str_replace_all(Activity, "\\bsurfing\\b", "surf"))
attacks$Activity[is.na(attacks$Activity)] <- "other" # for all activities that are not part of the list or for NA, we just put Other, since no information is available
#_____________________________________________________________________________________________
#FATAL
#When we run a table for the Fatality column, we see that almost observation follow the same category (Y = yes, N = No, Unknwn). An observation has a 2017, which is probably a typo (which we replace by empty case) while another one has an M. We want to believe that it was a typo as well, and that "M" was typed instead of "N".
attacks$Fatal..Y.N. <- gsub("2017", "", attacks$Fatal..Y.N.)
attacks$Fatal..Y.N. <- gsub("M", "N", attacks$Fatal..Y.N.)
attacks$Fatal..Y.N. <-na_if(attacks$Fatal..Y.N., "")
#For the rest of NA, we have no information, so we took of these few observations.
attacks <- subset(attacks, !is.na(Fatal..Y.N.))
#Replace Yes by 1, 0 otherwise
attacks$Fatal..Y.N. <- ifelse(attacks$Fatal..Y.N. == "Y", 1, 0)
names(attacks)[names(attacks) == "Fatal..Y.N."] <- "Fatality"
#_____________________________________________________________________________________________
#final check up:
#We run a table for all columns, in order to check if columns have any missing value or if we still have other mistakes.
table(attacks$Date) #this one is fine
table(attacks$Year) #this one is fine
table(attacks$Type) #here we have boating and boat which mean the same thing. Let us group them
attacks$Type <- gsub("Boating", "Boat", attacks$Type)
attacks$Type <- gsub("Boatomg", "Boat", attacks$Type)
table(attacks$Type) #this one is fine
table(attacks$Country)#this one is fine
table(attacks$Activity)#this one is fine
table(attacks$Age)#this one is fine
table(attacks$Fatality)#this one is fine
table(attacks$Species)#this one is fine
#With the following code, we will add a new column which will associate each country with its ISO code. This is a very important step for the cleaning if the enxt sections, because il will allow us to match the different datasets.
# Get ISO country codes
iso_codes <- countrycode(attacks$Country, "country.name", "iso3c")
attacks$ISO_Code <- countrycode(attacks$Country, "country.name", "iso3c")
#Through the following function we can spot NAs. We can see that there are 23 countries with no ISO Code. We will fix them by hand.
rows_with_na <- which(is.na(attacks$ISO_Code))
attacks$ISO_Code[attacks$Country %in% c("ENGLAND", "SCOTLAND", "BRITISH ISLES") & is.na(attacks$ISO_Code)] <- "GB"
attacks$ISO_Code[attacks$Country %in% c("AZORES") & is.na(attacks$ISO_Code)] <- "PRT"
attacks$ISO_Code[attacks$Country %in% c("ST. MAARTIN", "ST. MARTIN") & is.na(attacks$ISO_Code)] <- "MAF"
attacks$ISO_Code[attacks$Country %in% c("OKINAWA") & is.na(attacks$ISO_Code)] <- "JPN"
attacks$ISO_Code[attacks$Country %in% c("MICRONESIA") & is.na(attacks$ISO_Code)] <- "FSM"
attacks <- subset(attacks, !is.na(ISO_Code))
final_attacks_cleaned <- attacks
DT::datatable(attacks, options = list(pageLength = 10))
The main mission for this dataset was to change the entire structure of the dataset. Originally, the dataset had the years as columns (one column for each year), while the rows were represented by the different countries; the temperature observations were presented by month, quarter and year. Furthermore, for each country and each month/quarter/year, the dataset shows both the change in temperatures and the standard deviation.
To proceed with the cleaning, we eliminated the rows corresponding to the standard deviation (we will not work with it), and the monthly and quarterly observations: we are only interested in yearly temperature variations.
We then rotated the dataset so as to have all the years under a single column; we subsequently matched each country and year with its temperature change observation. Finally we added the ISO code to merge with the other datasets.
temperature <- read_xlsx(here::here("data/Temperature.xlsx"))
#R did not read the "°C" on the column Unit, so we changed it manually
temperature <- temperature %>% mutate(Unit = "°C")
#We take off all the columns that we do not need for our study
temperature <- temperature %>% select(-'Area Code (M49)', -'Months Code', -'Element Code')
#We wliminate all columns having an F at the end (For each year there are 2 columns (ex: 1961 - 1961F; 1962 - 1962F). The column with the F at the end of the Year is a column where only the letter E appears for each row. We tried to look for its meaning on the FAO website but we did not succeed. Therefore, we removed it. What is more, it is important to notice that it was in any case unimportant for our work).
columns_to_remove <- grep("F$", names(temperature), value = TRUE)
temperature <- temperature[, !(names(temperature) %in% columns_to_remove)]
# We keep only meteorological year, we don't need to work with monthly and seasonal data.
target_name <- "Meteorological year"
temperature <- temperature[temperature$Months == target_name, ]
target_name2 <- "Temperature change"
temperature <- temperature[temperature$Element == target_name2, ]
# We take off the rest of the columns that we do not need
temperature <- temperature %>% select(-'Area Code', -'Months', -'Element', -'Unit')
#The column of each year starts with an Y (Y1961, Y1961...). We take it off, so that we can easily match data with other datasets.
new_colnames <- gsub("Y", "", colnames(temperature))
colnames(temperature) <- new_colnames
temperature <- na.omit(temperature)
# We transpose the dataset so that columns become rows. With this function we have a single column, where each row corresponds to a year, but countries became the columns now. We will work on this later in the code.
temperature <- t(temperature)
# I want the first row to be the header
colnames(temperature) <- temperature[1, ]
clean_temperature <- temperature[-1, ]
#col names in CAPS
colnames(clean_temperature) <- toupper(colnames(clean_temperature))
library(tidyr)
# Convert the matrix/array "clean_temperatures" to a data frame
temperature <- as.data.frame(clean_temperature)
# Add 'Year' as a separate column
temperature$Year <- rownames(clean_temperature)
# Reshape the data from wide to long format
temperature <- tidyr::pivot_longer(temperature,
cols = -Year,
names_to = "Country",
values_to = "Temperature")
# Reorder columns as per your desired format
temperature <- temperature[c("Year", "Country", "Temperature")]
#As we said at the beginning, we only work with data from 1970
temperature <- temperature %>% filter(Year >= 1970)
temperature$Year <- as.numeric(temperature$Year)
temperature <- temperature %>%
mutate(Country = ifelse(Country == "UNITED STATES OF AMERICA", "USA", Country))
#As for the attacks dataset, we add a column for the ISO code.
temperature$ISO_Code <- countrycode(temperature$Country, "country.name", "iso3c")
temperature$ISO_Code[temperature$Country %in% c("MICRONESIA") & is.na(temperature$ISO_Code)] <- "FSM"
#Now that we put ISO code, we can remove the country column
temperature <- temperature %>% select(-'Country')
final_temperature_cleaned <- temperature
DT::datatable(temperature, options = list(pageLength = 10))
We choose a data set about the sea level change observed during the last years. We start analyzing the sea level changes since 1993 because we did not find any data of years before. In this data set we had few columns but we only kept 2 columns that were the more relevant for us: the year and the GMSL_GIA. This last column is the global mean of sea level including the glacial isostatic adjustement (GIA). We choose the column with GIA and not without because this show the response of solid Earth to the past changes in surface loading by ice and water such as the glaciation and deglaciation.
sealevel <- read.csv(here::here("data/sealevel.csv"))
# keep the 2 columns we are interested to analyze because the other are irrelevant for our project
sealevel <- subset(sealevel, select = c(Year, GMSL_GIA))
#Show the year only the first time by creating a new column called Year2
sealevel$Year2 <- ifelse(duplicated(sealevel$Year), NA, sealevel$Year)
# delete column Year due to the creation of column Year 2
sealevel <- subset(sealevel, select = -Year)
# delete NA because it does not bring anything to our analysis
sealevel <- na.omit(sealevel)
#Change name of column Year
column <- gsub("2","",colnames(sealevel))
colnames(sealevel) <- column
final_sealevel_cleaned <- sealevel
DT::datatable(sealevel, options = list(pageLength = 10))
Finally, the GHG emissions dataset was the easiest to clean. Indeed, we already have a column for the ISO code, so we simply had to filter the years so that the observations began in 1970.
co2 <- read.csv(here::here("data/CO2.csv"))
# Change names of columns in order to have the same columns that in the other datasets
names(co2)[names(co2) == "Entity"] <- "Country"
names(co2)[names(co2) == "Annual.CO..emissions"] <- "Annual CO2 Emissions"
names(co2)[names(co2) == "Code"] <- "ISO_Code"
# keep the 3 columns we are interested to analyze
co2 <- subset(co2, select = c("ISO_Code", "Year", "Annual CO2 Emissions"))
# only keep information starting in 1970 because we want to look at the evolution of
# the last 50 years
co2<- co2[co2$Year >= 1970, ]
co2$ISO_Code <- na_if(co2$ISO_Code, "")
co2 <- subset(co2, !is.na(ISO_Code))
final_ghg_cleaned <- co2
DT::datatable(co2, options = list(pageLength = 10))
After cleaning the four datasets, we have finally merged them. This step is essential to be able to run regression across variables that came from multiple datasets.
#MERGE FIRST TWO DATA SETS
# Assuming 'Year' and 'Country' are the columns in both datasets for matching
merged_data <- left_join(attacks, temperature, by = c("Year", "ISO_Code"))
#MERGE WITH THIRD
# Assuming 'Year' is the columns in both datasets for matching
merged_data2 <- left_join(merged_data, sealevel, by = c("Year"))
#MERGE WITH FOURTH
# Assuming 'Year' and 'Country' are the columns in both datasets for matching
merged_data3 <- left_join(merged_data2, co2, by = c("Year", "ISO_Code"))
merged_data3$Temperature <- as.numeric(as.character(merged_data3$Temperature))
# Assuming 'Year' and 'Country' are the columns in both datasets for matching
merged_data3 <- left_join(merged_data2, co2, by = c("Year", "ISO_Code"))
#Change name of a column
colnames(merged_data3)[colnames(merged_data3) == 'Country.x'] <- 'Country'
merged_data3$Temperature <- as.numeric(merged_data3$Temperature)
#for merged_data3, which we will use for our regressions, we need numerical variables
#therefore, we will make changes in some categorical ones
#TYPE
#Let's transform the Type variable on numeric. 1 corresponds to Boat, 2 to Unprovoked, Invalid to 3
# Provoked to 4, Questionable to 5 and Sea Disaster to 6
categories <- c("Boat" = 1, "Unprovoked" = 2, "Invalid" = 3, "Provoked" = 4, "Questionable" = 5, "Sea Disaster" = 6)
merged_data3$Type <- factor(merged_data3$Type, levels = names(categories))
merged_data3$Type <- as.numeric(merged_data3$Type)
#TIME
# Let's focus on the transformation of time where morning correspond to 0, afternoon to 1,
# evening to 2 and night to 3
merged_data3$Time <- as.numeric(factor(merged_data3$Time, levels = c("morning", "afternoon", "evening", "night")))
#SEX
merged_data3$Sex <- ifelse(merged_data3$Sex == "M", 0,
ifelse(merged_data3$Sex == "F", 1, 2))
#SPECIES
merged_data3 <- merged_data3 %>%
mutate(Species = as.numeric(factor(Species)))
# Get the top 5 species
top_species <- merged_data3 %>%
count(Species, sort = TRUE) %>%
slice_head(n = 5)
# Create a logical condition to include only the top 5 species
condition <- merged_data3$Species %in% top_species$Species
# Subset the data based on the condition
species_filtered_data <- merged_data3[condition, ]
DT::datatable(merged_data3, options = list(pageLength = 10))
With the y-axis showing the frequency of shark attacks and the x-axis showing the passage of time, the graph below displays a striking pattern in shark attacks over time. The data exhibits a recognizable pattern that indicates a steady increase in the quantity of shark attacks within the designated period. This rising tendency begs critical concerns regarding what is causing such an increase and necessitates more research into ecological, environmental, and human-related elements that could be involved in this occurrence. The graph emphasizes the need of keeping an eye on and comprehending the dynamics of these episodes in addition to providing a visual depiction of the rise in shark attacks.
This trend is a key research topic: why are marine attacks increasing year after year? What could be the reasons? In fact, the reasons can be various, for example of an ecological, environmental nature or linked to human seaside activity. The graph highlights the need to keep an eye on and understand the dynamics of these episodes, which is exactly what we will do during our study.
Three fundamental areas of this graph should be an element of analysis: 1975-1978; 1988-1990; 2019-2020. In fact, we could ask ourselves, why is there a decrease in attacks in this period? For the last period, the answer is simple: our dataset presents data up to June 2018, for the second half of the year we have no data, which explains the drastic drop in attacks. For the other two periods, we will discover the reasons during a more in-depth analysis
ggplot(data = final_attacks_cleaned, aes(x = Year)) +
geom_smooth(stat = "count", aes(group = 1), color = "blue", size = 1) +
geom_bar(stat = "count", fill = "red") +
labs(title = "Shark Attacks Evolution Over Years", x = "Year", y = "Number of Attacks")
To take a look at how this evolution happens withn males and females, we can run another plot showing that males have always been the sex facing more attacks.
# Create a plot to show the trend of shark attacks throughout the years, including victim's sex (interactive version)
attacks_evolution_sex <- ggplot(data = final_attacks_cleaned, aes(x = Year, fill = Sex)) +
geom_bar(position = "stack", color = "white") +
labs(title = "Shark Attacks Evolution Over Years by Sex",
x = "Year",
y = "Number of Attacks",
fill = "Sex") +
scale_fill_manual(values = c("pink", "blue", "orange"), name = "Sex") +
theme_minimal()
interactive_attacks_evolution_sex <- ggplotly(attacks_evolution_sex)
interactive_attacks_evolution_sex
Pattern of shark attacks frequencies through the day
The bar graph below, where the x-axis indicates four time categories —morning, afternoon, evening, and night— displays the distribution of shark attacks throughout the day. The graph’s most noticeable aspect is how frequently shark attacks occur in the afternoon, followed by the morning, evening and night. This concentration of incidents in the afternoon raises intriguing questions about potential human behavioral factors that could contribute to this temporal pattern.
First and foremost, we believe that the afternoon is when most people choose to go swimming, which contributes to the frequency of shark attacks during this time. Recreational water activities such as swimming, surfing, kayaking, and other water sports are probably more popular during these times. Furthermore, the afternoon is when the seas are the warmest, so it’s intriguing to find out if this aspect also has a significant influence on shark activity. Conversely, the decreased number of assaults that occur at night may be related to less people using the water during that time, but there may be other possible reasons as well (less high temperatures, poorer visibility).
Lastly, we suppose that the medium frequencies of the morning and evening might correspond to transitional periods when shark and human behavior are changing.
bar_colors <- c("#66c2a5", "#fc8d62", "#8da0cb", "#e78ac3")
# Create a sorted table
sorted_table <- table(final_attacks_cleaned$Time)
sorted_table <- sorted_table[order(-sorted_table)]
# Create a bar plot with the sorted data
barplot1 <- barplot(sorted_table,
col = bar_colors,
xlab = "Time of Day", ylab = "Frequency of Attacks",
main = "Frequency of Shark Attacks by Time of Day",
border = "white",
ylim = c(0, 1800),
space = 0.5,
cex.names = 0.8,
font.axis = 2,
beside = TRUE)
# Add text labels with frequencies on the bars
text_labels <- sorted_table
text(x = barplot1, y = sorted_table, labels = text_labels, pos = 3, cex = 0.8, col = "black")
Pattern of shark attacks frequencies per country
A first glance at our plotted data reveals that the United States, Australia, and South Africa are the three primary countries with a significant number of shark attacks.
shark_attacks_by_country <- final_attacks_cleaned %>%
group_by(Country) %>%
summarise(total_attacks = n())
shark_attacks_by_country <- shark_attacks_by_country %>%
mutate(CountryCategory = ifelse(total_attacks >= 30, as.character(Country), "Other"))
# Order the countries by frequency in descending order
shark_attacks_by_country$CountryCategory <- reorder(shark_attacks_by_country$CountryCategory, -shark_attacks_by_country$total_attacks)
# Plot the data
plot2<- ggplot(data = shark_attacks_by_country, aes(x = total_attacks, y = CountryCategory)) +
geom_col(fill = "skyblue") +
labs(title = "Total Shark Attacks in Each Country", x = "Number of Attacks", y = "Country") +
theme_minimal() +
theme(axis.text.y = element_text(hjust = 1)) +
scale_y_discrete(labels = scales::label_wrap(10))
interactive_plot <- ggplotly(plot2)
interactive_plot
Better understanding of the attacks and the sex of the victims
We initially believed that men experienced attacks more frequently, but our surprise grew when we examined this chart. It highlights a substantial disparity in the distribution of attacks between men and women.
# Create a summary table with the count of attacks for each sex
attacks_summary_sex <- final_attacks_cleaned %>%
group_by(Sex) %>%
summarise(NumAttacks = n())
# Define the desired order of levels
sex_order <- c("M", "F", "Unknown")
# Convert the 'Sex' variable to a factor with the specified order
attacks_summary_sex$Sex <- factor(attacks_summary_sex$Sex, levels = sex_order)
# Bar plot to show the distribution of shark attacks by sex
interactive_bar_plot_sex_attacks <- ggplotly(
ggplot(attacks_summary_sex, aes(x = Sex, y = NumAttacks, fill = Sex)) +
geom_bar(stat = "identity", position = "dodge", width = 0.7) +
labs(title = "Distribution of Shark Attacks by Sex",
x = "Sex",
y = "Number of Attacks") +
scale_fill_manual(values = c("blue", "pink", "gray"))
)
# Display the interactive bar plot
interactive_bar_plot_sex_attacks
What are the continents that have the most attacks?
This plot illustrates the regions with the highest number of attacks since 1970, with North America and Australia & Oceania taking the lead.
#Create a new column in attacks that represents the continents
attacks_grouped <- final_attacks_cleaned %>%
mutate(Region = case_when(
ISO_Code %in% c(NA) ~ "Unknown",
ISO_Code %in% c("AFG", "ARM", "AZE", "BHR", "BGD", "BTN", "BOL", "BGR", "BFA", "BDI", "KHM", "CMR", "CPV", "CHN", "CXR", "CYP", "HKG", "IND", "IDN", "IRN", "IRQ", "ISR", "JOR", "JPN", "KAZ", "KWT", "KGZ", "LAO", "LBN", "MAC", "MYS", "MDV", "MNG", "MMR", "NPL", "PRK", "OMN", "PAK", "PSE", "PHL", "QAT", "SAU", "SGP", "KOR", "LKA", "TUR", "SYR", "TWN", "TJK", "THA", "TKM", "ARE", "UZB", "YEM", "VNM") ~ "Asia",
ISO_Code %in% c("ALB", "AND", "AUT", "BLR", "BEL", "BIH", "BWA", "HRV", "CZE", "DNK", "EST", "FIN", "FRO", "FRA", "GEO", "DEU", "GRC", "GRL", "GLP", "HUN", "ISL", "IRL", "ITA", "LVA", "LTU", "LIE", "LUX", "MLT", "MDA", "MNE", "NLD", "MKD", "NOR", "POL", "PRT", "ROU", "RUS", "SRB", "SVK", "SVN", "ESP", "SWE", "CHE", "UKR", "GBR") ~ "Europe",
ISO_Code %in% c("DZA", "AGO", "BEN", "CAF", "TCD", "COM", "COG", "CIV", "COD", "DJI", "TLS", "EGY", "ERI", "SWZ", "ETH", "GNQ", "GAB", "GMB", "GHA", "GIN", "GNB", "KEN", "LSO", "LBR", "LBY", "MDG", "MWI", "MLI", "MTQ", "MRT", "MUS", "MYT", "MAR", "MOZ", "NAM", "NER", "NGA", "REU", "RWA", "SHN", "SEN", "SYC", "SLE", "STP", "ZAF", "SOM", "SSD", "SDN", "SUR", "TZA", "TGO", "TUN", "UGA", "ZMB", "ZWE") ~ "Africa",
ISO_Code %in% c("AIA", "ATG", "ABW", "BHS", "BMU", "BRB", "BLZ", "VGB", "BRN", "CAN", "CRI", "CUB", "CUW", "DMA", "DOM", "SLV", "GRD", "GTM", "HTI", "HND", "JAM", "MSR", "MEX", "USA", "NIC", "PAN", "KNA", "LCA", "SPM", "VCT", "SXM", "TCA", "TTO") ~ "North America",
ISO_Code %in% c("ARG", "COL", "CHL", "ECU", "GUF", "GUY", "PRY", "PER", "URY", "VEN", "BRA", "BES") ~ "South America",
ISO_Code %in% c("AUS", "COK", "FJI", "PYF", "KIR", "MHL", "FSM", "NRU", "NCL", "NZL", "NIU", "PLW", "PNG", "WSM", "SLB", "TON", "TUV", "VUT", "WLF") ~ "Australia & Oceania",
ISO_Code %in% c("ATA") ~ "Antarctica"
))
# Filter data to include only valid regions (excluding "Unknown")
filtered_attacks_region <- attacks_grouped %>%
filter(!is.na(Region) & Region != "Unknown")
# Bar plot to show the distribution of shark attacks by region
interactive_bar_plot_region_attacks <- ggplotly(
ggplot(filtered_attacks_region, aes(x = Region, fill = Region)) +
geom_bar() +
labs(title = "Distribution of Shark Attacks by Region",
x = "Region",
y = "Number of Attacks") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for better readability
)
# Display the interactive bar plot
interactive_bar_plot_region_attacks
Fatalities by type of shark
As we can see in this graph, most of attacks were not fatal, meaning that sharks injured a lot of people without taking their lives away. The types of sharks that did the most of accidents were the following ones: Bull shark, Tiger shark and White shark, with the deadlies being the last one. Why do they commit attacks? Probably because their feeding behavior. Maybe they misidentify their predators attacking people instead of animals like turtles, seals, or sea lions. This is consistent with Bull shark and Tiger shark being repetitevly named as the deadlies sharks in the seas and oeceans (CBS News, 2010). Another relevant information derived from this graph is the big number of unidentified sharks. Out of all the attacks recorded by our dataset, we have almost 1750 Unidentified species. It is such a pity not to be able to have this information, which would have led us to more insightful results.
# Merge the two shark species in the "Species" column
final_attacks_cleaned <- final_attacks_cleaned %>%
mutate(Species = ifelse(Species %in% c("Unidentifies shark", "Unidentified shark"), "Unidentified shark", Species))
# Get the top 10 shark species
top_10_sharks <- head(names(sort(table(final_attacks_cleaned$Species), decreasing = TRUE)), 10)
# Subset the data to include only the top 10 shark species
final_attacks_top_10 <- final_attacks_cleaned[final_attacks_cleaned$Species %in% top_10_sharks, ]
# Create the summary table for the top 10 shark species
summary_table_top_10 <- table(final_attacks_top_10$Species, final_attacks_top_10$Fatality)
# Convert the summary table to a data frame
summary_df_top_10 <- as.data.frame(summary_table_top_10)
# Create the ggplot with the top 10 shark species
plot_top_10 <- ggplot(summary_df_top_10, aes(x = Var1, y = Freq, fill = Var2)) +
geom_bar(stat = "identity") +
labs(
title = "Fatality by Top 10 Types of Shark",
x = "Type of Shark",
y = "Frequency",
fill = "Legend"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_manual(
name = "Was the shark attack fatal?",
values = c("blue", "pink", "yellow"),
labels = c("Yes", "No", "Unknown")
)
# Display the plot
print(plot_top_10)
What is the time of the year that have the most shark attacks?
As illustrated in this graph, there is a tendency for shark attacks to occur more frequently during the summer months compared to the rest of the year. It seems pretty logical as there is more people in the water during summer time.
#Here it is great but I tried to class the months in order but it has not worked yet
# Create a plot to show the repartition of shark attacks per month by sex
attacks_per_months <- ggplot(data = final_attacks_cleaned, aes(x = factor(Date), fill = Sex)) +
geom_bar(position = "stack", color = "white") +
labs(title = "Shark Attacks Repartition by Month and Sex",
x = "Month",
y = "Number of Attacks",
fill = "Sex") +
scale_fill_manual(values = c("pink", "blue", "orange"), name = "Sex") +
theme_minimal()
interactive_attacks_per_months <- ggplotly(attacks_per_months)
interactive_attacks_per_months
The graph below illustrates the change in global temperature throughout the last few decades, showing an overall increase in temperature over the 50-year period. Despite some fluctuations, the upward trend indicates a clear warming pattern, with a notable acceleration following the 1980s and one of the two most significant rises occurring in the recent decade.
# Filter out NA values and non-numeric values in the Temperature column
temperature_data_filtered <- final_temperature_cleaned %>%
filter(!is.na(Temperature), !is.na(as.numeric(Temperature)))
# Convert the Temperature column to numeric
temperature_data_filtered$Temperature <- as.numeric(temperature_data_filtered$Temperature)
# Calculate the mean temperature for each year
world_temperature <- temperature_data_filtered %>%
group_by(Year) %>%
summarise(World_Temperature = mean(Temperature, na.rm = TRUE))
# Create a line plot for the world temperature
plot5 <- ggplot(world_temperature, aes(x = Year, y = World_Temperature)) +
geom_line() +
labs(title = "World Temperature Change Over Years",
x = "Year",
y = "World Temperature Change") +
theme_minimal()
interactive_plot <- ggplotly(plot5)
interactive_plot
This line graph depicts the trend of rising sea levels from around the ’90s until the early 2020s, showing some fluctuations initially, and a notable upward trend thereafter. The early 2000s represent a steady rate of increase, with a minor decrease in the early 2010s. This pattern appears to be somewhat consistent with global temperature changes, and a steepening rise in the last couple of years indicate a complementary relation with the increasing temperature levels.
# Create a line plot
plot6 <- ggplot(data = final_sealevel_cleaned, aes(x = Year, y = GMSL_GIA)) +
geom_line(color = "green", size = 1.5) +
labs(title = "Evolution of Sea Level Over the Years", x = "Year", y = "Sea Level")
interactive_plot <- ggplotly(plot6)
interactive_plot
Trend of CO2 emissions per year
The bar graph reveals an evident increase of emissions in the depicted period, as indicated by a consistent upward trend with few fluctuations and no significant declines. The rise reflects the growth of industrialisation over the years, especially marked by the end of the Cold War and subsequent economic growth in many parts of the world. The early 2000s show a particularly sharp increase, which is likely to be as a result of the rapid increase in manufacturing and energy production in Asian regions. Post-2010 years show some fluctuations that could be attributed to emissions mitigation efforts, but the trend overall remains the same.
# Create a line plot with backticks to show the trend in co2 emissions throughout the years
plot7 <- ggplot(data = final_ghg_cleaned, aes(x = Year, y = `Annual CO2 Emissions`)) +
geom_line(color = "blue") +
labs(title = "Trend of CO2 Emissions Over the Years", x = "Year", y = "Annual CO2 Emissions")
interactive_plot <- ggplotly(plot7)
interactive_plot
Which continent generates more GHG emissions?
We can see that there is a significant disparity in emissions among different continents; Asia shows a sharp increase in emissions over the years, particularly accelerating after 2000, becoming the continent with the highest CO2 emissions. This is perfectly consistent with other data, showing that 5 out of the 6 most polluting countries in the world are in Asian, namely China, India, Japan, Russia and Iran. (Statista, 2021) Emissions in Europe depict a gradual rise until the early 90’s, with a noticeable fluctuation and decline thereafter. The North American region follows a similar trend to Europe with a steady increase until the 2000s, followed by a levelling off and a decline. On the other hand, CO2 emissions from South America show a steady increase over time with less fluctuations as compared to Europe and NA. Africa’s emissions have additionally been increasing at a steady rate over the last few decades, but they remain significantly lower compared to the rest of the continents. Oceania represents the lowest overall emissions among the listed regions, with an increase rate on par with Africa. Antarctica’s emissions are not visible on the graph, which is consistent with expectations as it is a largely uninhabited continent.
co2_grouped <- final_ghg_cleaned %>%
mutate(Region = case_when(
ISO_Code %in% c(NA) ~ "Unknown",
ISO_Code %in% c("AFG", "ARM", "AZE", "BHR", "BGD", "BTN", "BOL", "BGR", "BFA", "BDI", "KHM", "CMR", "CPV", "CHN", "CXR", "CYP", "HKG", "IND", "IDN", "IRN", "IRQ", "ISR", "JOR", "JPN", "KAZ", "KWT", "KGZ", "LAO", "LBN", "MAC", "MYS", "MDV", "MNG", "MMR", "NPL", "PRK", "OMN", "PAK", "PSE", "PHL", "QAT", "SAU", "SGP", "KOR", "LKA", "TUR", "SYR", "TWN", "TJK", "THA", "TKM", "ARE", "UZB", "YEM", "VNM") ~ "Asia",
ISO_Code %in% c("ALB", "AND", "AUT", "BLR", "BEL", "BIH", "BWA", "HRV", "CZE", "DNK", "EST", "FIN", "FRO", "FRA", "GEO", "DEU", "GRC", "GRL", "GLP", "HUN", "ISL", "IRL", "ITA", "LVA", "LTU", "LIE", "LUX", "MLT", "MDA", "MNE", "NLD", "MKD", "NOR", "POL", "PRT", "ROU", "RUS", "SRB", "SVK", "SVN", "ESP", "SWE", "CHE", "UKR", "GBR") ~ "Europe",
ISO_Code %in% c("DZA", "AGO", "BEN", "CAF", "TCD", "COM", "COG", "CIV", "COD", "DJI", "TLS", "EGY", "ERI", "SWZ", "ETH", "GNQ", "GAB", "GMB", "GHA", "GIN", "GNB", "KEN", "LSO", "LBR", "LBY", "MDG", "MWI", "MLI", "MTQ", "MRT", "MUS", "MYT", "MAR", "MOZ", "NAM", "NER", "NGA", "REU", "RWA", "SHN", "SEN", "SYC", "SLE", "STP", "ZAF", "SOM", "SSD", "SDN", "SUR", "TZA", "TGO", "TUN", "UGA", "ZMB", "ZWE") ~ "Africa",
ISO_Code %in% c("AIA", "ATG", "ABW", "BHS", "BMU", "BRB", "BLZ", "VGB", "BRN", "CAN", "CRI", "CUB", "CUW", "DMA", "DOM", "SLV", "GRD", "GTM", "HTI", "HND", "JAM", "MSR", "MEX", "USA", "NIC", "PAN", "KNA", "LCA", "SPM", "VCT", "SXM", "TCA", "TTO") ~ "North America",
ISO_Code %in% c("ARG", "COL", "CHL", "ECU", "GUF", "GUY", "PRY", "PER", "URY", "VEN", "BRA", "BES") ~ "South America",
ISO_Code %in% c("AUS", "COK", "FJI", "PYF", "KIR", "MHL", "FSM", "NRU", "NCL", "NZL", "NIU", "PLW", "PNG", "WSM", "SLB", "TON", "TUV", "VUT", "WLF") ~ "Australia & Oceania",
ISO_Code %in% c("ATA") ~ "Antarctica"
))
# Step 1: Calculate Mean Emissions by Year and Region
mean_emissions <- co2_grouped %>%
group_by(Year, Region) %>%
summarize(Mean_Emissions = mean(`Annual CO2 Emissions`, na.rm = TRUE))
# Now we get rid of that Unknown Region to have a better view of the other Regions
# Step 1: Calculate Mean Emissions by Year and Region and getting rid of the things we don't want
mean_emissions <- co2_grouped %>%
filter(!is.na(Region) & !is.na(`ISO_Code`) & Region != "Unknown") %>%
group_by(Year, Region) %>%
summarize(Mean_Emissions = mean(`Annual CO2 Emissions`, na.rm = TRUE))
# Step 2: Let's plot the Data
plot2 <- ggplot(mean_emissions, aes(x = Year, y = Mean_Emissions, color = Region)) +
geom_line() +
labs(title = "CO2 Emissions by Continent",
x = "Year",
y = "Annual CO2 Emissions") +
theme_minimal()
interactive_plot <- ggplotly(plot2)
interactive_plot
What are the countries with the most CO2 emissions?
Here is a plot to further illustrate the distribution of CO2 emissions in the world; the top 30 countries with the most shark attacks since 1970. One interesting thing is that most countries are European countries. However, without surprise, the two biggest pollutant name of the world are the United States of America and China.
# Filter out 'OWID_WRL' from the data
co2_filtered_wrl <- final_ghg_cleaned %>% filter(ISO_Code != "OWID_WRL")
# Select the top 30 countries based on CO2 emissions
top30_countries <- co2_filtered_wrl %>%
group_by(ISO_Code) %>%
summarize(`Annual CO2 Emissions` = sum(`Annual CO2 Emissions`)) %>%
arrange(desc(`Annual CO2 Emissions`)) %>%
top_n(30)
#> Selecting by Annual CO2 Emissions
# Create an interactive bar plot
interactive_top30_countries_co2 <- ggplotly(
ggplot(data = top30_countries, aes(x = reorder(ISO_Code, -`Annual CO2 Emissions`), y = `Annual CO2 Emissions`)) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(title = "Top 30 Countries with the Most CO2 Emissions", x = "Countries", y = "Annual CO2 Emissions") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
)
# Display the interactive bar plot
interactive_top30_countries_co2
Now that we have a better understanding of our variables, we can analyse their relationships.
How is the relationship between the attacks and the age of the victims?
There exists a negative correlation between the age of the victim and the frequency of attacks. This can be attributed to factors such as the higher water engagement of teens and adults below 40 compared to older individuals, leading to a greater involvement in water-related activities.
# Create a summary table with the count of attacks for each age
attacks_summary <- final_attacks_cleaned %>%
group_by(Age) %>%
summarise(NumAttacks = n())
# Scatter plot to show the relationship between the number of shark attacks and age
interactive_attacks_age <- ggplotly(
ggplot(attacks_summary, aes(x = Age, y = NumAttacks)) +
geom_point(color = "blue", size = 2) +
geom_smooth(method = "lm", se = FALSE, color = "red") + # Add a linear trend line
labs(title = "Relationship between Shark Attacks and Age",
x = "Age",
y = "Number of Attacks")
)
#> `geom_smooth()` using formula = 'y ~ x'
interactive_attacks_age
Does sea level and the number of shark attacks have a relationship?
We can discern a positive correlation between the increase in sea levels and the occurrence of shark attacks. This observation is particularly intriguing since we commonly understand that rising sea levels result from global warming. Consequently, it raises the possibility that human influence may be a more significant factor in the surge of shark attacks than previously acknowledged.
# Calculate the number of attacks per year
attacks_summary <- final_attacks_cleaned %>%
group_by(Year) %>%
summarize(Number_of_Attacks = n())
# Assuming there's a common column named "Year" in both datasets
merged_data5 <- merge(attacks_summary, sealevel, by = "Year", all.x = TRUE)
correlation_coefficient <- cor(merged_data5$Number_of_Attacks, merged_data5$GMSL_GIA, use = "complete.obs")
# Scatter plot
interactive_scatter_plot_sea_level_attacks <- ggplotly(
ggplot(merged_data5, aes(x = GMSL_GIA, y = Number_of_Attacks)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "red") + # Add a linear trend line
labs(title = "Relationship Between Sea Level and Shark Attacks",
x = "Global Mean Sea Level Global Isostatic Adjustment (GMSL_GIA)",
y = "Number of Attacks")
)
#> `geom_smooth()` using formula = 'y ~ x'
# Display the interactive scatter plot
interactive_scatter_plot_sea_level_attacks
How is the relationship between temperature change and the number of shark attacks?
In addition to the previous plot, it is evident that there is a positive correlation between global temperatures and the frequency of shark attacks. This further strengthens the earlier argument, suggesting potentially a stronger human responsibility for these attacks.
# Identify non-numeric values in the Temperature column
non_numeric_temp <- final_temperature_cleaned %>%
filter(!is.numeric(Temperature)) %>%
distinct(Temperature)
# Convert "Temperature" column to numeric
final_temperature_cleaned$Temperature <- as.numeric(final_temperature_cleaned$Temperature)
# Check for negative values
negative_values <- final_temperature_cleaned %>%
filter(Temperature < 0)
# Calculate the mean temperature
mean_temp_world <- final_temperature_cleaned %>%
group_by(Year) %>%
summarize(Mean_Temperature = mean(Temperature, na.rm = TRUE))
# Filter out non-numeric values in the Temperature column
temperature2 <- final_temperature_cleaned %>%
filter(is.numeric(Temperature))
# Calculate mean temperature for the world per year
mean_temp_world <- temperature2 %>%
group_by(Year) %>%
summarize(Mean_Temperature = mean(Temperature, na.rm = TRUE))
# Merge datasets
merged_data6 <- merge(attacks_summary, mean_temp_world, by = "Year", all.x = TRUE)
# Plotting
interactive_scatter_plot_worldtemperature_attacks <- ggplotly(
ggplot(merged_data6, aes(x = Mean_Temperature, y = Number_of_Attacks)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "red") + # Add a linear trend line
labs(title = "Relationship between Shark Attacks and World Mean Temperature Change",
x = "World Mean Temperature Change",
y = "Number of Shark Attacks") +
theme_minimal()
)
#> `geom_smooth()` using formula = 'y ~ x'
interactive_scatter_plot_worldtemperature_attacks
#Interactive map
#Import the data concerning the map
map <- read.csv(here::here("data/map.csv"))
map <- map[c('latitude', 'longitude', 'country')]
#Let's rename the columns of the dataset
colnames(map)[colnames(map) == 'latitude'] <- 'lat'
colnames(map)[colnames(map) == 'longitude'] <- 'lng'
colnames(map)[colnames(map) == 'country'] <- 'Country'
map$Country <- toupper(map$Country)
map$Country <- ifelse(map$Country == "UNITED STATES", "USA", map$Country)
merged_map <- merge(merged_data3, map, by = "Country", all = FALSE)
# Create a new variable representing the nb of attacks per country
results <- merged_map %>%
group_by(Country) %>%
summarise(Attackscountry = n())
print(results)
#> # A tibble: 69 x 2
#> Country Attackscountry
#> <chr> <int>
#> 1 ARUBA 1
#> 2 AUSTRALIA 520
#> 3 BAHAMAS 77
#> 4 BRAZIL 96
#> 5 CANADA 1
#> 6 CHILE 1
#> 7 CHINA 4
#> 8 COLOMBIA 3
#> 9 COSTA RICA 5
#> 10 CROATIA 3
#> # i 59 more rows
#Attach aggregated data to your original dataframe
merged_map <- left_join(merged_map, results, by = "Country")
# Definition of the thresholds for the categorization
seuils <- c(0, 50, 100, 500, Inf)
# Definition of the colors we want to have in the map
#couleurs <- c("#4DA6FF", "#0074CC", "#6C8EBF", "#001F3F80")
couleurs <- c("green", "yellow", "orange", "red")
# Add a new category column based on thresholds
merged_map$cat_attacks <- cut(merged_map$Attackscountry, breaks = seuils, labels = FALSE)
merged_map$echelle_taille <- merged_map$Attackscountry * 0.1
# Let's have fun with an interactive map
ma_carte <- leaflet(merged_map) %>%
setView(lng = -95, lat = 37, zoom = 2) %>%
addTiles() %>%
addCircleMarkers(
lng = ~lng,
lat = ~lat,
radius = ~sqrt(echelle_taille) * 2,
color = ~factor(merged_map$cat_attacks, labels = couleurs),
fillOpacity = 0.4,
label = ~paste(Country, ":", Attackscountry),
options = markerOptions(autoPopup = TRUE)
) %>%
addLegend(
position = "bottomleft",
colors = couleurs,
labels = c("Less than 50", "50 to 100", "100 to 500", "More than 500"),
title = "Number of shark attacks"
)
# Personnaliser le style CSS pour la carte
ma_carte$dependencies[[1]]$stylesheet <- "leaflet.css"
# Ajouter du style personnalisé
ma_carte$dependencies[[1]]$inline <- TRUE
ma_carte$dependencies[[1]]$script <- "$('#map').css({'width': '80%', 'height': '300px', 'float': 'left'});"
# Let's see the map
ma_carte
After a deep Exploratory Data Analysis we thought about the relevance of some factors that might influence directly the occurrence of shark attacks. Directly in the sense that these are variables that were found in the main data set and not in a secondary one. Let’s begin then:
Before doing any regression, we will compute the correlation matrix in order to analyze our numerical variables
data_RQ1 <- merged_map
data_RQ1 <- na.omit(data_RQ1)
#Before doing the regression, we will do a correlation matrix of the numerical variables of the main data
numeric_columns <- sapply(data_RQ1, is.numeric)
#Let's choose the numeric columns only
numeric_data <- data_RQ1[, numeric_columns]
#We will only keep the variables we are interested in
selected_columns <- c("Year", "Type", "Sex", "Age", "Time", "Species")
data_selected <- data_RQ1[, selected_columns]
# Calculate the correlation matrix
correlation_matrix <- cor(data_selected)
corrplot(
correlation_matrix,
method = "ellipse",
type = "upper",
tl.col = "black",
tl.srt = 45,
addCoef.col = "gray"
)
Thanks to this correlation matrix, we can affirm that: - Year has a slight positive correlation (0.13) with Age. The older the individual, there is a slight increase in the likelihood of being tied up by a shark, probably due to the activity the individual was doing such as surfing or swimming over time. In another side, there is a negative correlation (-0.1) with Species. - Age has a slightly positive correlation (0.11) with Time almost. - The variable Type has a slight positive correlation with Age and Species while it has a slight negative correlation with Year, Species and Sex. - There is a negative correlation between Year and Species. - The following variables have no significant correlation with the other numeric variables, due to the value of the coefficient close to 0: Sex and Type.
The regression that we will use is the following one:
\[ \hat{Y} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \beta_4 X_4 + \beta_5 X_5 + \beta_6 X_6 + \epsilon \]
where:
Let’s do our main regression based on all the countries
# Let's do an overall multiple regression model
model <- lm(Attackscountry ~ Year + Sex + Age + Species + Type + Time, data = merged_map)
original_scipen <- options("scipen")
options(scipen = 1000)
stargazer(model,
title = 'Regression on the main factors influencing shark attacks',
type = 'html',
digits = 3)
| Dependent variable: | |
| Attackscountry | |
| Year | 3.710*** |
| (0.879) | |
| Sex | -34.200* |
| (20.100) | |
| Age | -9.540*** |
| (0.856) | |
| Species | -2.020 |
| (1.330) | |
| Type | -46.000*** |
| (15.100) | |
| Time | 8.730 |
| (16.800) | |
| Constant | -6,120.000*** |
| (1,764.000) | |
| Observations | 2,901 |
| R2 | 0.051 |
| Adjusted R2 | 0.049 |
| Residual Std. Error | 603.000 (df = 2894) |
| F Statistic | 25.900*** (df = 6; 2894) |
| Note: | p<0.1; p<0.05; p<0.01 |
This first regression will be based only on the USA
# USA Regression
#This first regression will be based only on the USA
usa_data <- merged_map %>%
filter(Country == "USA")
#Regression model of the USA
model1 <- lm(Attackscountry ~ Year + Sex + Age + Species + Type + Time, data = usa_data)
# Compute the VIF
vif_values <- car::vif(model1)
VIF values:
| Variable | VIF |
|---|---|
| Year | 1.03 |
| Sex | 1.02 |
| Age | 1.03 |
| Species | 1.03 |
| Type | 1.04 |
| Time | 1.01 |
original_scipen <- options("scipen")
options(scipen = 1000)
stargazer(model1,
title = 'Regression on the main factors influencing shark attacks focused on the USA',
type = 'html',
digits = 3)
| Dependent variable: | |
| Attackscountry | |
| Year | -0.000 |
| (0.000) | |
| Sex | 0.000 |
| (0.000) | |
| Age | 0.000 |
| (0.000) | |
| Species | -0.000 |
| (0.000) | |
| Type | 0.000 |
| (0.000) | |
| Time | -0.000 |
| (0.000) | |
| Constant | 1,479.000*** |
| (0.000) | |
| Observations | 1,479 |
| R2 | 0.500 |
| Adjusted R2 | 0.498 |
| Residual Std. Error | 0.000 (df = 1472) |
| F Statistic | 245.000*** (df = 6; 1472) |
| Note: | p<0.1; p<0.05; p<0.01 |
# Assuming 'usa_data' is your dataset for the USA
# Include the intercept in the initial model
current_model <- lm(Attackscountry ~ 1, data = usa_data)
# List to store models at each step
selected_models <- list()
# Variables to add
variables_to_add <- c("Year", "Sex", "Age", "Species", "Time", "Type")
# Forward selection loop
for (variable in variables_to_add) {
# Add a variable to the formula
formula <- as.formula(paste(". ~ . +", variable))
current_model <- update(current_model, formula)
# Store the current model in the list
selected_models[[variable]] <- current_model
# Calculate AIC
aic_value <- AIC(current_model)
# Print the summary of the current model and AIC value
cat("Variable added:", variable, " | AIC:", aic_value, "\n")
print(summary(current_model))
}
#> Variable added: Year | AIC: -66827
#>
#> Call:
#> lm(formula = Attackscountry ~ Year, data = usa_data)
#>
#> Residuals:
#> Min 1Q Median 3Q
#> -0.0000000014319 0.0000000000003 0.0000000000012 0.0000000000019
#> Max
#> 0.0000000000025
#>
#> Coefficients:
#> Estimate Std. Error
#> (Intercept) 1479.0000000001439275 0.0000000001614318
#> Year -0.0000000000000911 0.0000000000000807
#> t value Pr(>|t|)
#> (Intercept) 9161765526064.88 <0.0000000000000002 ***
#> Year -1.13 0.26
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.0000000000373 on 1477 degrees of freedom
#> Multiple R-squared: 0.5, Adjusted R-squared: 0.5
#> F-statistic: 1.48e+03 on 1 and 1477 DF, p-value: <0.0000000000000002
#>
#> Variable added: Sex | AIC: -66826
#>
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex, data = usa_data)
#>
#> Residuals:
#> Min 1Q Median 3Q
#> -0.0000000014315 0.0000000000002 0.0000000000012 0.0000000000019
#> Max
#> 0.0000000000028
#>
#> Coefficients:
#> Estimate Std. Error
#> (Intercept) 1479.0000000001498393 0.0000000001618005
#> Year -0.0000000000000943 0.0000000000000809
#> Sex 0.0000000000011236 0.0000000000019330
#> t value Pr(>|t|)
#> (Intercept) 9140887108111.84 <0.0000000000000002 ***
#> Year -1.17 0.24
#> Sex 0.58 0.56
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.0000000000373 on 1476 degrees of freedom
#> Multiple R-squared: 0.5, Adjusted R-squared: 0.499
#> F-statistic: 738 on 2 and 1476 DF, p-value: <0.0000000000000002
#>
#> Variable added: Age | AIC: -66824
#>
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age, data = usa_data)
#>
#> Residuals:
#> Min 1Q Median 3Q
#> -0.0000000014309 0.0000000000000 0.0000000000011 0.0000000000021
#> Max
#> 0.0000000000038
#>
#> Coefficients:
#> Estimate Std. Error
#> (Intercept) 1479.0000000001623448 0.0000000001625604
#> Year -0.0000000000001011 0.0000000000000813
#> Sex 0.0000000000012543 0.0000000000019402
#> Age 0.0000000000000553 0.0000000000000695
#> t value Pr(>|t|)
#> (Intercept) 9098156172803.24 <0.0000000000000002 ***
#> Year -1.24 0.21
#> Sex 0.65 0.52
#> Age 0.80 0.43
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.0000000000373 on 1475 degrees of freedom
#> Multiple R-squared: 0.5, Adjusted R-squared: 0.499
#> F-statistic: 492 on 3 and 1475 DF, p-value: <0.0000000000000002
#>
#> Variable added: Species | AIC: -66823
#>
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species, data = usa_data)
#>
#> Residuals:
#> Min 1Q Median 3Q
#> -0.0000000014307 -0.0000000000001 0.0000000000011 0.0000000000021
#> Max
#> 0.0000000000041
#>
#> Coefficients:
#> Estimate Std. Error
#> (Intercept) 1479.0000000001696208 0.0000000001632457
#> Year -0.0000000000001039 0.0000000000000815
#> Sex 0.0000000000012481 0.0000000000019407
#> Age 0.0000000000000558 0.0000000000000696
#> Species -0.0000000000000577 0.0000000000001121
#> t value Pr(>|t|)
#> (Intercept) 9059961123428.15 <0.0000000000000002 ***
#> Year -1.27 0.20
#> Sex 0.64 0.52
#> Age 0.80 0.42
#> Species -0.51 0.61
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.0000000000373 on 1474 degrees of freedom
#> Multiple R-squared: 0.5, Adjusted R-squared: 0.499
#> F-statistic: 368 on 4 and 1474 DF, p-value: <0.0000000000000002
#>
#> Variable added: Time | AIC: -66821
#>
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species + Time,
#> data = usa_data)
#>
#> Residuals:
#> Min 1Q Median 3Q
#> -0.0000000014306 -0.0000000000001 0.0000000000011 0.0000000000021
#> Max
#> 0.0000000000043
#>
#> Coefficients:
#> Estimate Std. Error
#> (Intercept) 1479.0000000001691660 0.0000000001633155
#> Year -0.0000000000001034 0.0000000000000816
#> Sex 0.0000000000012477 0.0000000000019414
#> Age 0.0000000000000546 0.0000000000000699
#> Species -0.0000000000000576 0.0000000000001122
#> Time -0.0000000000002559 0.0000000000014858
#> t value Pr(>|t|)
#> (Intercept) 9056089947796.18 <0.0000000000000002 ***
#> Year -1.27 0.21
#> Sex 0.64 0.52
#> Age 0.78 0.43
#> Species -0.51 0.61
#> Time -0.17 0.86
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.0000000000373 on 1473 degrees of freedom
#> Multiple R-squared: 0.5, Adjusted R-squared: 0.498
#> F-statistic: 295 on 5 and 1473 DF, p-value: <0.0000000000000002
#>
#> Variable added: Type | AIC: -66819
#>
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species + Time +
#> Type, data = usa_data)
#>
#> Residuals:
#> Min 1Q Median 3Q
#> -0.0000000014306 -0.0000000000001 0.0000000000011 0.0000000000021
#> Max
#> 0.0000000000043
#>
#> Coefficients:
#> Estimate Std. Error
#> (Intercept) 1479.0000000001682565 0.0000000001644287
#> Year -0.0000000000001030 0.0000000000000820
#> Sex 0.0000000000012574 0.0000000000019500
#> Age 0.0000000000000544 0.0000000000000701
#> Species -0.0000000000000567 0.0000000000001135
#> Time -0.0000000000002586 0.0000000000014871
#> Type 0.0000000000000889 0.0000000000016120
#> t value Pr(>|t|)
#> (Intercept) 8994777923269.14 <0.0000000000000002 ***
#> Year -1.26 0.21
#> Sex 0.64 0.52
#> Age 0.78 0.44
#> Species -0.50 0.62
#> Time -0.17 0.86
#> Type 0.06 0.96
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.0000000000373 on 1472 degrees of freedom
#> Multiple R-squared: 0.5, Adjusted R-squared: 0.498
#> F-statistic: 245 on 6 and 1472 DF, p-value: <0.0000000000000002
In order to have a better model by doing a forward selection where we will take the model with the lowest AIC.
In order to have a better model by doing a forward selection where we will take the model with the lowest AIC.
Here we have the AIC values for each model:
As said previously, we chose the model with the lowest AIC indicating a better fit of the model, but we decided to choose it because the model includes more variables than the others, thus best explaining the variability in the data.
This second regression will be based only on Australia
#This second regression will be based only on Australia
aus_data <- merged_map %>%
filter(Country == "AUSTRALIA")
model2 <- lm(Attackscountry ~ Year + Sex + Age + Species + Type + Time, data = aus_data)
original_scipen <- options("scipen")
options(scipen = 1000)
stargazer(model2,
title = 'Regression on the main factors influencing shark attacks focused on AUSTRALIA',
type = 'html',
digits = 3)
| Dependent variable: | |
| Attackscountry | |
| Year | -0.000 |
| (0.000) | |
| Sex | 0.000 |
| (0.000) | |
| Age | -0.000 |
| (0.000) | |
| Species | -0.000 |
| (0.000) | |
| Type | 0.000 |
| (0.000) | |
| Time | -0.000 |
| (0.000) | |
| Constant | 520.000*** |
| (0.000) | |
| Observations | 520 |
| R2 | 0.500 |
| Adjusted R2 | 0.494 |
| Residual Std. Error | 0.000 (df = 513) |
| F Statistic | 85.400*** (df = 6; 513) |
| Note: | p<0.1; p<0.05; p<0.01 |
# Include the intercept in the initial model
current_model2 <- lm(Attackscountry ~ 1, data = aus_data)
# List to store models at each step
selected_models2 <- list()
# Variables to add
variables_to_add2 <- c("Year", "Sex", "Age", "Species", "Time", "Type")
# Forward selection loop
for (variable in variables_to_add2) {
# Add a variable to the formula
formula <- as.formula(paste(". ~ . +", variable))
current_model2 <- update(current_model2, formula)
# Store the current model in the list
selected_models[[variable]] <- current_model2
# Calculate AIC
aic_value2 <- AIC(current_model2)
# Print the summary of the current model and AIC value
cat("Variable added:", variable, " | AIC:", aic_value, "\n")
print(summary(current_model2))
}
#> Variable added: Year | AIC: -66819
#>
#> Call:
#> lm(formula = Attackscountry ~ Year, data = aus_data)
#>
#> Residuals:
#> Min 1Q Median
#> -0.00000000012420 0.00000000000023 0.00000000000025
#> 3Q Max
#> 0.00000000000026 0.00000000000027
#>
#> Coefficients:
#> Estimate Std. Error
#> (Intercept) 519.99999999999772626 0.00000000003604512
#> Year -0.00000000000000173 0.00000000000001800
#> t value Pr(>|t|)
#> (Intercept) 14426365221367.6 <0.0000000000000002 ***
#> Year -0.1 0.92
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.00000000000546 on 518 degrees of freedom
#> Multiple R-squared: 0.5, Adjusted R-squared: 0.499
#> F-statistic: 517 on 1 and 518 DF, p-value: <0.0000000000000002
#>
#> Variable added: Sex | AIC: -66819
#>
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex, data = aus_data)
#>
#> Residuals:
#> Min 1Q Median
#> -0.00000000012415 0.00000000000027 0.00000000000029
#> 3Q Max
#> 0.00000000000029 0.00000000000030
#>
#> Coefficients:
#> Estimate Std. Error
#> (Intercept) 519.999999999995338840 0.000000000036493311
#> Year -0.000000000000000546 0.000000000000018213
#> Sex 0.000000000000176122 0.000000000000402815
#> t value Pr(>|t|)
#> (Intercept) 14249186641170.37 <0.0000000000000002 ***
#> Year -0.03 0.98
#> Sex 0.44 0.66
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.00000000000547 on 517 degrees of freedom
#> Multiple R-squared: 0.5, Adjusted R-squared: 0.498
#> F-statistic: 259 on 2 and 517 DF, p-value: <0.0000000000000002
#>
#> Variable added: Age | AIC: -66819
#>
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age, data = aus_data)
#>
#> Residuals:
#> Min 1Q Median
#> -0.00000000012415 0.00000000000027 0.00000000000029
#> 3Q Max
#> 0.00000000000029 0.00000000000030
#>
#> Coefficients:
#> Estimate Std. Error
#> (Intercept) 519.999999999995338840 0.000000000036790730
#> Year -0.000000000000000566 0.000000000000018402
#> Sex 0.000000000000176164 0.000000000000403241
#> Age 0.000000000000000161 0.000000000000020290
#> t value Pr(>|t|)
#> (Intercept) 14133995074177.91 <0.0000000000000002 ***
#> Year -0.03 0.98
#> Sex 0.44 0.66
#> Age 0.01 0.99
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.00000000000547 on 516 degrees of freedom
#> Multiple R-squared: 0.5, Adjusted R-squared: 0.497
#> F-statistic: 172 on 3 and 516 DF, p-value: <0.0000000000000002
#>
#> Variable added: Species | AIC: -66819
#>
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species, data = aus_data)
#>
#> Residuals:
#> Min 1Q Median
#> -0.00000000012414 0.00000000000021 0.00000000000030
#> 3Q Max
#> 0.00000000000031 0.00000000000034
#>
#> Coefficients:
#> Estimate Std. Error
#> (Intercept) 519.999999999996248334 0.000000000037075547
#> Year -0.000000000000000868 0.000000000000018490
#> Sex 0.000000000000180007 0.000000000000404139
#> Age 0.000000000000000389 0.000000000000020345
#> Species -0.000000000000005997 0.000000000000032010
#> t value Pr(>|t|)
#> (Intercept) 14025416977620.41 <0.0000000000000002 ***
#> Year -0.05 0.96
#> Sex 0.45 0.66
#> Age 0.02 0.98
#> Species -0.19 0.85
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.00000000000548 on 515 degrees of freedom
#> Multiple R-squared: 0.5, Adjusted R-squared: 0.496
#> F-statistic: 129 on 4 and 515 DF, p-value: <0.0000000000000002
#>
#> Variable added: Time | AIC: -66819
#>
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species + Time,
#> data = aus_data)
#>
#> Residuals:
#> Min 1Q Median
#> -0.00000000012411 0.00000000000018 0.00000000000024
#> 3Q Max
#> 0.00000000000034 0.00000000000060
#>
#> Coefficients:
#> Estimate Std. Error
#> (Intercept) 519.999999999997271516 0.000000000037225645
#> Year -0.000000000000001263 0.000000000000018538
#> Sex 0.000000000000182530 0.000000000000404543
#> Age -0.000000000000000825 0.000000000000020643
#> Species -0.000000000000006679 0.000000000000032094
#> Time -0.000000000000120797 0.000000000000337369
#> t value Pr(>|t|)
#> (Intercept) 13968864912409.60 <0.0000000000000002 ***
#> Year -0.07 0.95
#> Sex 0.45 0.65
#> Age -0.04 0.97
#> Species -0.21 0.84
#> Time -0.36 0.72
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.00000000000548 on 514 degrees of freedom
#> Multiple R-squared: 0.5, Adjusted R-squared: 0.495
#> F-statistic: 103 on 5 and 514 DF, p-value: <0.0000000000000002
#>
#> Variable added: Type | AIC: -66819
#>
#> Call:
#> lm(formula = Attackscountry ~ Year + Sex + Age + Species + Time +
#> Type, data = aus_data)
#>
#> Residuals:
#> Min 1Q Median
#> -0.00000000012408 0.00000000000016 0.00000000000024
#> 3Q Max
#> 0.00000000000036 0.00000000000061
#>
#> Coefficients:
#> Estimate Std. Error
#> (Intercept) 519.999999999994884092 0.000000000037882689
#> Year -0.000000000000000186 0.000000000000018813
#> Sex 0.000000000000213282 0.000000000000414516
#> Age -0.000000000000001548 0.000000000000020766
#> Species -0.000000000000005450 0.000000000000032317
#> Time -0.000000000000140862 0.000000000000342595
#> Type 0.000000000000102150 0.000000000000295018
#> t value Pr(>|t|)
#> (Intercept) 13726586376664.34 <0.0000000000000002 ***
#> Year -0.01 0.99
#> Sex 0.51 0.61
#> Age -0.07 0.94
#> Species -0.17 0.87
#> Time -0.41 0.68
#> Type 0.35 0.73
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.00000000000549 on 513 degrees of freedom
#> Multiple R-squared: 0.5, Adjusted R-squared: 0.494
#> F-statistic: 85.4 on 6 and 513 DF, p-value: <0.0000000000000002
# Compute the VIF
vif_values2 <- car::vif(model2)
VIF values:
| Variable | VIF |
|---|---|
| Year | 1.08 |
| Sex | 1.07 |
| Age | 1.06 |
| Species | 1.03 |
| Type | 1.11 |
| Time | 1.07 |
Let’s focus on both countries
# This regression will be based on both USA and AUSTRALIA
combined_data <- merged_map %>%
filter(Country %in% c("USA", "AUSTRALIA"))
# Run the regression based on both USA and AUSTRALIA
model_combined <- lm(Attackscountry ~ Year + Sex + Age + Species + Type + Time, data = combined_data)
original_scipen <- options("scipen")
options(scipen = 1000)
stargazer(model_combined,
title = 'Regression on the main factors influencing shark attacks focused on the USA and AUSTRALIA',
type = 'html',
digits = 3)
| Dependent variable: | |
| Attackscountry | |
| Year | -1.130 |
| (0.754) | |
| Sex | -23.900 |
| (17.500) | |
| Age | -5.850*** |
| (0.677) | |
| Species | -4.170*** |
| (1.100) | |
| Type | -27.600** |
| (13.700) | |
| Time | 10.600 |
| (13.700) | |
| Constant | 3,826.000** |
| (1,514.000) | |
| Observations | 1,999 |
| R2 | 0.050 |
| Adjusted R2 | 0.047 |
| Residual Std. Error | 411.000 (df = 1992) |
| F Statistic | 17.400*** (df = 6; 1992) |
| Note: | p<0.1; p<0.05; p<0.01 |
#Model combined, AIC
# Include the intercept in the initial model
current_model3 <- lm(Attackscountry ~ 1, data = combined_data)
# List to store models at each step
selected_models3 <- list()
# Variables to add
variables_to_add3 <- c("Year", "Sex", "Age", "Species", "Time", "Type")
# Forward selection loop
for (variable in variables_to_add3) {
#Add a variable to the formula
formula <- as.formula(paste(". ~ . +", variable))
current_model3 <- update(current_model3, formula)
# Store the current model in the list
selected_models[[variable]] <- current_model3
# Calculate AIC
aic_value2 <- AIC(current_model3)}
We have done the following type of models of this first regression that will help us to answer this first research question:
Each type of model according to their focus has its own primary factors influencing the occurrence of shark attacks. For example, in the first one (Overall), Age, Year, Type and Time are the factors that influence the attacks. But the two that are highly significant are Age and Year. In the 4th one, the one that is focused on both countries, the main factors are Age, Species, Type and Time. In conclusion, Age, Type and Time are highly significant predictors, thus they strongly influence the occurrence of shark attacks from a statistical point of view. We did not consider models focus on one country because they are not that relevant. We choose to analyze more than 1 country because it makes more sense. We analyzed the two countries with the most shark attacks because we had the most data about these countries. It gives us more details and the significative factors that influence the shark attacks. We reduce the complexity of our model because of the focus on the regions with the most attacks, so it makes our results more relevant.
In this paragraph, we aim to address the research question of what factors contribute to the fatality of shark attacks. The variable we will try to study this time is “Fatality”, which appears to be a binomial variable, where 0 expresses no fatality, 1 expresses fatality. In the EDA we were already able to see that fortunately, most of shark attacks are not fatal, but still, there are some cases where these accidents take off some lives. This investigation is compelling because the identification of these factors can help guide targeted safety protocols and public education initiatives.
The model we will use to answer our question is that of a logistic regression, which utilizes ‘Date,’ ‘Year,’ ‘Age,’ and ‘Time’ as predictors.
\[ \hat{Y} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \beta_4 X_4 + \epsilon \]
where:
filtered_data4 <- merged_data3 %>%
filter(ISO_Code %in% c("ZAF", "USA", "AUS"))
filtered_data4 <- merged_data3 %>%
filter(ISO_Code %in% c("ZAF", "USA", "AUS")
)
filtered_data4$Sex <- factor(filtered_data4$Sex)
model12 <- glm(Fatality ~ Date + Year + Age + Time,
data = filtered_data4,
family = "binomial")
| Dependent variable: | |
| Fatality | |
| Date | -0.007 |
| (0.026) | |
| Year | -0.030*** |
| (0.006) | |
| Age | 0.026*** |
| (0.006) | |
| Time | -0.161 |
| (0.127) | |
| Constant | 56.200*** |
| (12.300) | |
| Observations | 2,319 |
| Log Likelihood | -552.000 |
| Akaike Inf. Crit. | 1,114.000 |
| Note: | p<0.1; p<0.05; p<0.01 |
We will start by analiyzing the significant coefficient.
The intercept of 55.643 is the estimated log-odds of fatality when all predictors are zero. Since, our variable of interest is a binomal one, it can be difficult to interpret this log-odds directly, though, and it might not be clear how it affects the likelihood of death. This value needs a deepen econometric study in order to be comprehended.
The coefficient for ‘Year’ is -0.029, suggesting that the likelihood of a fatal outcome in shark attacks decreases by around 0,29% as the year increases. This trend may be caused by developments in emergency medical response and care, public education and awareness campaigns about shark safety, enhanced beach surveillance and warning systems, or adjustments in recreational activities and behavior near coastal areas.
The ‘Age’ coefficient is 0.027, indicating that for each unit increase in age, the log-odds of fatality increase by 0.0269. One explanation for this correlation may be that older people have less physical strength and agility, which makes it harder for them to flee from or defend against a shark attack. Additionally, older people may be more vulnerable to severe injuries from shark bites. Furthermore, the observed higher risk of death in older age groups may be explained by behavioral factors such as an increased propensity to participate in riskier water activities or an increased amount of time spent in the water.
The last two coefficients, that of Date and Time are not significant, but let is dive into them!
The ‘Date’ coefficient is -0.006, but with a p-value of 0.85, it is not statistically significant, implying that the month of the shark attack may not be a strong predictor of fatality. But this was not a surprise. Indeed, the three countries we took into consideration have seasons that vary throughout the year, based on their position on the globe. For example, Summer in USA goes from June to September, but in Australia and South Africa it goes from December to February/March. This might create a disequilibrium when R reads them.
Remarkably, the ‘Time’ coefficient is 0.044, indicating that the time of day may not be a predictive factor for shark attack fatalities. The lack of significance of ‘Time’ may suggest that the time of an attack by a shark has little bearing on whether it ends fatally. Intuitively, we thought that night attacks would be more fatal, given the difficulty in asking or receiving help, but apparently this is not the case.
The graphs in the section on Exploratory Data Analysis (EDA) show us that the elements impacting climate change are trending upward. Interestingly, there is a comparable rising trend in the incidence of shark attacks. It presents a valid question regarding whether these two developments are correlated and how strong that association is. The problem of climate change is intricate and multidimensional. Three primary variables will be the focus of our investigation: temperature change, sea level rise, and CO2 emissions. Although we recognize that many other factors also contribute to this environmental phenomenon, we will focus on these specific variables and simplify the analysis in this way.
We must include a new column in our dataset that indicates the number of shark attacks that occur annually in each nation in order to answer this issue. We shall be able to investigate their frequency in this way.
\[ \hat{Y} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \epsilon \]
where:
#The first thing we do is creating a copy of our dataset, so that we can take away some information that are not needed here. Indeed, with the na.omit function, we delete all the rows of the years
#after 1992. The reason for this is that we were not able to find a dataset on sea level information that contained information starting from 1970.
data_RQ3 <- merged_data3
data_RQ3 <- na.omit(data_RQ3)
str(data_RQ3) #Firstly, we run the str function to check if all the variables of interest are num or int, and not chr
#Run correlation matrix to be sure that there is no multicollinearity. When we run it, we see that
#all vorrelations are far from being equal to 1 or -1, which is a positive sign.
Before running our regression, it is important to track if there is any sign of multicollinearity in our data. For this reason, we run two correlation plots. The first plot helps us visualize pretty quickly what is the correlation magnitude of the variables analized. The fact that we do not see any dark blue or dark red circle between two different variables, tells us that the correlation is far from being equal to the extremes 1 and -1.
A clearer plot is the second one, that provides the reader with the exact correlation coefficients. The takeaway is that there is no linear dependence among the predictors we have chosen: this observation gives us confidence in the validity of our regression analysis, and opens us the doors to run our regression!
subset_data <- data_RQ3[, c("Temperature", "Annual CO2 Emissions", "GMSL_GIA")]
correlation_matrix <- cor(subset_data, use = "complete.obs")
corrplot(correlation_matrix, method = "circle")
# Customized corrplot for the subset
corrplot(
correlation_matrix,
method = "ellipse",
type = "upper",
tl.col = "black",
tl.srt = 45,
addCoef.col = "gray"
)
# We create a new variable 'SharkAttacksCount' representing the frequency of shark attacks per year
count_shark_attacks <- data_RQ3 %>%
group_by(Year, ISO_Code) %>%
summarize(SharkAttacksCount = n())
#> `summarise()` has grouped output by 'Year'. You can override using
#> the `.groups` argument.
# Then, we merge the aggregated shark attacks data back to your original dataset (merged_data3) based on the 'year' and 'ISO_Code' variables. We do this, so as to be able to run a the regression
data_RQ3 <- merge(data_RQ3, count_shark_attacks, by = c("Year", "ISO_Code"), all.x = TRUE)
# FIRST MODEL
filtered_data <- data_RQ3 %>%
filter(ISO_Code %in% c("USA", "ZAF", "AUS"))
model <- lm(SharkAttacksCount ~ Temperature + GMSL_GIA + `Annual CO2 Emissions`, data = filtered_data)
| Dependent variable: | |
| SharkAttacksCount | |
| Temperature | 4.340*** |
| (0.551) | |
| GMSL_GIA | 0.286*** |
| (0.011) | |
Annual CO2 Emissions
|
0.000*** |
| (0.000) | |
| Constant | 8.000*** |
| (0.562) | |
| Observations | 1,717 |
| R2 | 0.792 |
| Adjusted R2 | 0.791 |
| Residual Std. Error | 8.830 (df = 1713) |
| F Statistic | 2,171.000*** (df = 3; 1713) |
| Note: | p<0.1; p<0.05; p<0.01 |
The results of our regression are really interesting to analyse! The statistical significance of the coefficients related to temperature change, sea level change, and CO2 emissions indicates their important contribution to our predictive model. The complex interaction between environmental dynamics and shark activity is highlighted by the positive correlation found between the annual frequency of shark attacks and these three climate change parameters.
The constant term denotes the baseline frequency of shark attacks when all environmental factors are held constant, and it is equal to 8.00000. This means that, when we hold all the parameters to 0 (more precisely, if there is no detection of climate change), we would still have a positive amount of shark attacks.The interpretation is fairly clear, indicating that shark attacks are obviously not caused by climate change per se.
Our second finding is that the temperature change beta coefficient is 4.340, and the significance level is indicated by three asterisks (***). This highlights a strong and favorable correlation between temperature changes and the number of shark attacks. Similarly, the sea level change beta coefficient, which is marked with three asterisks and registers at 0.286, indicates a significant positive correlation between rising sea levels and an increase in shark attacks. Finally, the beta coefficient for CO2 emissions is very close to zero, still remaining positive. This notifies us that an increase of CO2 emissions leads to a very small increase of shark attacks.
Finally, we also notice that \(R^2\) is equal to 0,792, meaning that, in this context, our variables explain 79,2% of the variation of our variables of interest.
CBS News. (2010, October 24). Five Most Dangerous Sharks to Humans. https://www.cbsnews.com/pictures/five-most-dangerous-sharks-to-humans/
Midway, S. R., Wagner, T., & Burgess, G. H. (2019). Trends in global shark attacks. PLOS ONE, 14(2), e0211049. https://doi.org/10.1371/journal.pone.0211049
West, J. G. (2011). Changing patterns of shark attacks in Australian waters. Marine and Freshwater Research, 62(6), 744. https://doi.org/10.1071/mf10181
World’s biggest CO₂ emitters 2021 | Statista. (2023, September 12). Statista. https://www.statista.com/statistics/271748/the-largest-emitters-of-co2-in-the-world/